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1 . ABSTRACT 

Pattern  recognition  is  a statistical  technique  that  allows  one  to  find 
or  predict  a property  of  chemicals  that  is  not  directly  measurable,  but  is 
known  to  depend  upon  certain  features  or  properties  of  the  chemicals  via 
some  totally  unknown  relationship.  This  technique  has  been  applied  to  a 
multitude  of  scientific  problems.  The  same  technique  was  used  to  classify 
a chemical  according  to  its  relative  hazard  in  bulk  water-transportation 
based  on  chemical  structure  and  macro-scale  properties  such  as  density, 
vapor  pressure,  structure-fragments,  solubilities,  etc. 

Using  the  Linear-Learning  Machine,  the  overall  prediction  of  the  47 
compounds  in  training  set  was  68%  correct.  The  predicted  classifications 
of  the  240  compounds  in  the  test  set  are  approximately  68%  correct.  There 
are  many  difficulties  associated  with  properly  classifying  compounds  on 
the  basis  of  variables  derived  from  structural  fragments  that  must  be 
solved  before  great  reliance  can  be  placed  on  the  results  of  a Linear- 
Learning  Machine  classification. 
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2.  INTRODUCTION 


The  U.  S.  chemical  industry  transports  the  bulk  of  its  chemical  feed- 
stocks and  products  by  water.  This  movement  of  large  amounts  of  chemicals 
by  tankers,  barges,  etc.,  constitutes  a definite  fire,  health  and  poison 
hazard  as  well  as  possible  physiological  irritants  and  water  pollutants 
if  spills  occur  (1).  The  U.  S.  Coast  Guard  has  primary  responsibility  for 
the  safety  of  shipping,  waterways  and  citizens  of  this  nation.  Some  methods 
of  assessing  the  hazards  of  these  chemicals  during  bulk  transportation  by 
water  have  been  developed  by  various  organizations  (1-5)  for  the  U.  S. 

Coast  Guard.  The  National  Research  Council 's  Committee  on  Hazardous  Mater- 
ials (Division  of  Chemistry  and  Chemical  Technology)  (1)  has  issued  a 
Tentative  Guide  for  the  Evaluation  of  the  Hazard  of  Bulk  Water  Transporta- 
tion of  Industrial  Chemicals  which  outlines  a system  of  evaluation.  It 
also  tentatively  rates  337  common  industrial  chemicals. 

The  Fire  Hazard  aspects  of  this  overall  problem  has  deeply  concerned 
the  Coast  Guard  because  of  the  potential  loss  of  people,  ship  and  damage  to 
the  environment.  Underwriters'  Laboratories,  Inc.  have  tested  53  chemicals 
determining  the  flame  propagation  effects  and  pressure  piling  developed  by 
various  gas  and/or  vapor-air  mixtures  of  these  chemicals.  They  used  the 
Westerberg  Explosion  Test  Vessel  which  measures  flash  points,  ignition 
temperatures  and  flammability  limits  of  these  chemicals.  Using  this  data, 
the  Electrical  Hazards  Panel  of  the  Committee  on  Hazardous  Materials  (the 
Division  of  Chemistry  and  Chemical  Technology,  National  Research  Council) 
has  tentatively  classified  370  chemicals  as  to  their  relative  fire  hazard 
with  respect  to  explosion-proof  electrical  equipment.  These  tentative 
classifications  are  based  on  the  experimental  data  from  the  subset  of  53 
chemicals  that  have  been  tested  by  Underwriters ' Laboratories,  Inc.  The 
current  classifications  for  a large  number  (>200)  of  chemical  compounds 
are  essentially  an  educated  guess  by  a panel  of  experts. 
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3.  RESEARCH  OBJECTIVES 

The  overall  objective  of  the  research  conducted  under  this  contract 
was  to  utilize  pattern  recognition  techniques  to  develop  a computer  program 
that  would  quickly,  cheaply  and  effectively  evaluate  a chemical  compound  as 
to  its  fire-hazard  classification  for  bulk  water  transport.  The  information 
to  perform  this  classification  would  be  obtained  from  the  chemical  structure 
and  other  simple  chemical -physical  properties  of  the  compound. 

Chemometrics,  a growing  discipline  in  chemistry,  can  be  defined  as 
the  study  of  new  mathematical  and  statistical  approaches  to  solving  chemi- 
cal problems.  Pattern  recognition  (9),  a subset  of  these  mathematical 
methods,  has  recently  been  applied  to  many  chemical  problems.  Recent  re- 
views (8,9)  reference  much  of  the  literature  that  demonstrates  the  unique 
adaptations  of  pattern  recognition  techniques  to  solve  problems  in  chemistry. 
The  same  techniques  have  been  applied  herein  to  the  problem  of  classifying 
a chemical  according  to  its  relative  fire  hazard  during  bulk  water  trans- 
portation. 
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4.  SOFTWARE  DEVELOPMENT 

Almost  50S  of  our  effort  was  spent  on  developing  computer  codes  for  use 
on  the  fire  hazard  classification  problem.  A generalized  factor  analysis 
program,  3-dimensional  plotting  and  hierarchial  clustering  routines  were 
written  for  use  on  IBM-370/155.  These  programs  proved  to  be  of  marginal 
use  in  solving  the  problem.  A program  named  ARTHUR  (12)  written  by  Duewer, 
Harper  and  Kowaliski  from  the  University  of  Washington  and  Fasching,  Weisel 
and  Stromberg  from  the  University  of  Rhode  Island  was  used  to  obtain  the 
data  presented  in  this  report.  Appendix  A and  B explain  in  detail  the  terms, 
behavior  and  capabilities  of  these  programs. 
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5.  EVALUATION  OF  VARIOUS  VARIABLES 
The  objective  of  this  examination  was  to  find  the  fewest  numDer  of 
variables  that  gave  complete  separation  of  the  compounds  in  the  training 
set  according  to  their  category.  The  variables  were  chosen  because  they 
give  distinguishing  characteristics  about  each  compound's  reactivity,  com- 
bustibility, vapor  pressure  and/or  some  other  property  which  might  con- 
tribute  to  its  fire  hazard.  All  the  variables  being  used  are  either  chemi- 
cal or  physical  values  that  are  measurable  quantities  as  either  distinct 
numbers  or  categorized  from  experimental  data.  Using  measured  quantities 
avoids  any  biases  that  would  result  from  data  that  is  subject  to  change  be- 
cause of  extrapolation  by  the  scientist.  Variables  for  which  no  experimental 
data  was  found  were  estimated  by  using  a range  of  various  values  based  on 
other  similar  compounds  and  general  trends  for  that  variable. 

The  variable  order  (Table  I)  was  determined  using  histograms  (Fig.l); 
two  features  of  the  linear  learning  machines  in  ARTHUR  described  below; 
step-wise  discriminate  function  routines  in  the  BMDP  package  (12);  and  the 
Fisher-Weight,  Variance  Weight  and  Property  Weight  step-wise  discriminate 
features  of  the  SELECT  routine  in  ARTHUR  (Appendix  II).  Step-wise  dis- 
criminate methods  assume  that  one  can  separate  out  variables  on  the  basis 
of  decreasing  importance  with  respect  to  variance. 

The  histograms  allow  us  to  determine  which  variable  has  the  least 
amount  of  association,  i.e.  is  the  most  independent.  The  two  features 
listed  below  describe  how  the  linear  learning  machine  was  used  to  determine 
the  efficiency  of  a set  of  .variables  in  predicting  the  results.  They  are: 
the  smaller  the  number  of  passes  made  the  quicker  the  results  converged 
and  if  100%  separation  was  not  obtained  which  compound  was  incorrectly 
classified.  The  second  feature  is  useful  since  an  examination  of  the 
type  of  compounds  incorrectly  classified  in  the  training  set  shows  whether 
too  little  or  too  much  emphasis  is  being  included  about  particular  classes 


of  compounds  or  functional  groups  The  result  of  our  variable  selection  on 
the  training  set  is  listed  in  Table  V and  Table  VI  Table  V shows  that  com- 
plete separation  of  the  compounds  into  the  appropriate  categories  is  obtained 
when  all  compounds  are  included  Table  VI  is  a compilation  of  the  results 
of  a JACKKNIFE  study.  The  JACKKNIFE  procedure  uses  one  of  known  compounds 
as  a test  case  and  compares  the  predicted  value  with  the  experimentally  de- 
termined value.  The  complete  training  set  is  treated  in  this  fashion.  When 
all  of  the  compounds  are  used  100%  separation  into  the  correct  categories 
is  obtained.  In  JACKKNIFE  this  was  not  the  case.  For  example  when  2-nitro- 
propane  is  used  as  a test  compound,  therefore,  left  out  of  training  set 
and  not  included  in  the  determination  of  the  eigenvectors,  it  is  incorrectly 
associated  with  the  D group.  Since  2-nitropropane  is  the  only  compound  in 
the  training  set  containing  a NO,  group,  it  is  probably  that  unique  charac- 
teristic that  aids  in  the  compound's  classification  when  included  in  the 
training  set  (Table  V).  2-nitropropane,  therefore  seems  to  be  able  to  con- 
tribute usefu 1 information  upon  which  decisions  about  other  NC^  containing 
compounds  can  then  be  made.  Step-wise  discriminate  analysis  (described  in 
Appendix  II)  determines  the  order  of  importance  of  variables  according  to 
how  well  a variable  retains  the  compounds  in  the  appropriate  category. 

The  reasons  these  routines  yielded  the  ordering  found  in  Table  I is 
not  very  obvious  as  it  depends  on  complex  interactions  among  the  variables. 
The  ordering  is  based  on  empirical  results.  A simple  case  of  this  is  that 
total  chlorine  and  the  number  of  chlorines  attached  to  carbon  seems  to  have 
very  different  importance  as  they  are  listed  as  variable  numbers  17  and  39 
respectively,  although  one  would  guess  that  they  should  have  very  similar 
importance.  Actually  in  the  data  set  chosen  they  are  exactly  equivalent, 
i.e.  all  chlorines  present  are  attached  to  carbons.  The  step-wise  dis- 
criminant program  in  the  BMDP  package  and  those  in  the  ARTHUR  package 
when  presented  with  two  variables  that  are  equivalent  or  one  that  is  a 
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linear  combination  of  another  or  sets  of  others,  recognize  this  fact  and 
essentially  eliminates  one  of  the  variables  by  making  its  contnDution  to 
the  classifying  function  negligable,  thereby  eliminating  duplicate  informa- 
tion. In  addition  to  eliminating  duplicate  information  differences  in  the 
unit  size  among  variables  was  standardized  by  autoscaling  to  unit  variance 
and  zero  mean.  This  feature  normalizes  differences  between  variables  due 
to  units,  thus  making  it  possible  to  compare  values  such  as  temperature, 
solubility  and  number  of  functional  groups  present.  It  also  makes  it  in- 
consequential what  units  one  chooses  for  measuring  any  variable,  such  as 
ignition  temperatures,  whether  it  be  degrees  kelvin,  centigrade  or  faren- 
heit,  provided  the  same  unit  is  used  for  all  compounds. 

Table  IV  lists  the  13  variables  that  were  found  to  give  the  optimal 
results.  An  examination  of  Table  IV  shows  that  a combination  of  variables 
that  contain  information  distributed  among  all  compounds,  such  as  AIT,  molec- 
ular weight  and  solubilities  must  be  coupled  with  information  about  specific 
functional  groups  such  as  epoxy,  nitro  and  NH  groups.  We  again  wish  to  men- 
tion that  if  any  large  family  of  compounds  containing  one  type  of  functional 
group  is  eventually  going  to  be  classified,  it  is  important  to  have  a few 
examples  of  the  family  in  the  training  set  to  see  if  that  functional  group 
contributes  to  its  classification  or  if  its  effects  can  be  accounted  for  by 
other  variables  present.  The  reason  for  the  classifying  function  containing 
both  variables  which  contain  values  for  all  compounds  and  variables  that 
only  a few  compounds  that  have  a value  other  than  zero  is  the  following: 
the  first  type  of  variable  sets  up  a basis  where  very  general  trends  are 
found,  such  as  high  molecular  weight  compounds  are  less  hazardous  than 
lighter  ones.  The  second  type,  functional  groups,  are  needed  since  they 
can  activate  or  deactivate  the  reactivity  of  a compound  greatly,  thus  shift- 
ing its  classification. 
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6 CLASSIFICATION  PREDICTION 

The  variables  list  generated,  as  described  in  the  prev’ous  section,  was 
utilized  to  predict  the  categories  of  unknowns  First,  we  tried  to  separate 
the  classified  compounds  into  their  appropriate  categories  and  minimize  the 
number  of  variables  needed  for  the  optimal  results  This  was  then  checked 
to  estimate  the  accuracy  of  our  prediction  and  then  was  applied  to  unknowns 
A discussion  of  the  procedures  used  to  accomplish  these  steps  follows 

All  of  tne  compounds  that  have  been  experimentally  classified  into  NAS 
groups  B,  C,  and  D at  temperatures  of  less  than  25:C  were  used  in  this  re- 
search. (These  are  listed  in  Table  V)  The  two  compounds  classified  into 
the  A group,  acetylene  and  carbon  disulfide  are  included  into  category  B. 

This  produces  a group  of  compound  that  include  hazardous  classification  B 
or  above.  The  reason  for  this  is  that  only  two  compounds  do  not  provide  an 
adequate  mathematical  basis  to  establish  a legitimate  pattern  to  distinguish 
which  variables  are  instrumental  in  classifying  compounds  or  form  a base  for 
predicting  unknown  compounds.  Leaving  the  two  compounds  out  completely  does 
not  markly  effect  the  B groups. 

As  mentioned  above,  only  compounds  classified  experimentally  below  25°C 
are  being  utilized.  Those  whose  data  were  obtained  above  255C  are  not  being 
included  since  their  classification  at  the  normalized  temperature  may  not 
be  the  same  and  therefore  would  contribute  inaccurate  information  to  the 
list  of  variable  values  for  those  compounds  The  methyl  acetylene-propadiene 
(MAPP)  gas  mixture  and  gasoline  mixtures  are  also  not  being  used,  sifcce 
unique  chemical  and  physical  properties  cannot  be  defined  for  these  sub- 
stances. It  should  be  noted  that  the  classifying  method  we  are  attempting 
to  develop  cannot  be  used  for  any  mixtures  or  heterogeneous  substances. 

The  BMDP  program  used  requires  that  a priority  of  weighting  be  set  be- 
tween the  categories.  This  value  should  reflect  both  the  suspected  "cost" 
of  the  misclassifying  of  a compound  that  actually  belongs  to  the  group  B 
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into  the  group  C,  B into  D,  etc.,  and  the  probability  of  a randomly  selected 
compound  being  either  a B,  C,  or  D.  A number  of  different  priorities 
were  used  with  the  ratio  of  B to  C to  D of  l to  2 to  1 most  prevalent.  This 
ratio  was  calculated  by  estimating  the  cost  of  misclassif ’cation  of  a B into 
a C and  a C into  a D category  of  five  times  (John  M.  Cece,  private  communi- 
cation, 1976)  since  it  is  thought  in  the  case  of  an  accident  the  inadequate 
safeguards  would  cause  a higher  loss  of  both  life  and  material  than  the  ex- 
pense of  using  a higher  safeguard  system  for  a less  dangerous  compound  The 
probability  of  random  selection  was  calculated  by  dividing  the  total  number 
of  compounds  in  each  category  by  the  total  number  of  compounds  listed  in  the 
report  entitled  "Matrix  of  Electrical  and  Fire  Hazard  Properties  and  Classi- 
fication of  Chemicals"  by  the  Committee  of  Hazardous  Materials.  It  was 
assumed  that  this  report  contained  a large,  randomly  selected  samples  of 
chemicals  that  will  be  shipped  and  that  a reasonable  number  of  assigned 
categories  are  correct  (-75%). 

Table  V is  a list  of  the  47  chemical  compounds  that  were  used  as  a 
training  set  for  the  various  pattern  recognition  techniques  Only  compounds 
that  were  not  experimentally  tested  at  elevated  temperatures  were  included. 

A training  set  is  a group  of  compounds  that  the  programs  assume  to  be 
correct  and  is  used  to  train  the  program  to  predict  classification  of  test 
samples.  The  two  principal  learning  machines  that  were  finally  used  to 
solve  the  fire  hazard  classification  problem  were  PLANE  and  MULTI.  The 
programs  can  train  themselves  to  be  100%  correct  with  respect  to  the  train- 
ing set.  (Appendix  II) 

Table  VI  is  a list  of  the  same  training  set  shown  in  Table  V,  but 
each  compound  was  considered  to  be  a test  set  respectively.  Forty-seven 
computer  runs  were  subsequently  performed  using  the  JACKKNIFE  procedure 
described  previously  with  the  46  chemicals  being  the  training  set  The 
results  of  these  experiments  are  summarized  in  Table  VII.  The  programs 
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PLANE  and  MULTI  correctly  predict  the  training  set~70%  of  the  time  when  each 
compound  in  the  training  set  is  considered  as  a test  compound 

A summary  of  the  compounds  that  were  misclasslf led  is  given,  by  cate- 
gory, in  Table  VII  Category  D was  predicted  at  a reliability  between  74% 
and  79%,  category  C at  67%  and  category  B at  43%.  A possible  explanation 
for  the  misclassifications  of  the  experimentally  determined  compounds  are 
derived  from  their  spark  gap  values.  Boundaries  of  spark  gap  values  of 
0.010"  and  0.030"  are  the  apparent  divisions  between  the  B and  C categories 
and  C and  D categories  respectively.  There  are  a few  compounds  that  are 
classified  in  a category  other  than  would  be  suggested  by  these  cutoffs 
due  to  their  anomalous  high  pressure  piling  values  (>250  psig).  We  have 
predicted  a number  of  these  compounds  to  be  different  from  the  assigned 
category.  They  are  ethylene  diamine,  vinyl  chloride,  cyclopropane,  1,3 
butadiene  and  propylene.  An  additional  number  of  compounds  within  0.005" 
of  the  spark  gap  boundaries  are  also  misclassified  (Table  III).  Two  com- 
pounds with  whose  category  we  have  agreed  have  spark  gap  values  on  the 
"wrong"  side  of  the  boundary.  They  are  isoprene,  classified  as  a C with 
a spark  gap  values  of  0.037",  and  acrolein,  classified  as  a B with  a 
spark  gap  value  of  0.018".  It  seems  probable,  therefore,  that  some  of  the 
problems  we  have  had  could  be  due  to  the  boundaries  between  categories  not 
being  clear  cut  and  pressure  piling  values  only  being  considered  when  they 
are  very  large. 

Three  of  the  misclassified  compounds  in  the  training  set  (hydrogen, 
hydrogen  sulfide  and  carbon  disulfide)  are  the  only  inorganic  compounds 
present.  Since  organic  and  inorganic  compounds  do  not  react  the  same  way 
chemically,  it  is  possible  that  these  were  placed  in  an  incorrect  category 
since  we  based  their  prediction  on  information  derived  from  organic  com- 
pounds. 
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The  size  of  the  training  set  and  the  uniqueness  of  ce^ta’n  characteris- 
tics present  in  the  chemical  being  tested  may  result  in  incorrect  classifica- 
tion. If  a characteristic  is  not  present  in  any  compounds  remaining  in  the 
training  set  yet  contributes  greatly  to  the  fire  hazard  of  the  chemical 
tested,  an  incorrect  result  would  occur.  The  classification  would  there- 
fore be  based  on  other  features  of  the  compound  because  there  is  no  infor 
mation  about  that  unique  feature. 

Table  IX  lists  the  test  compounds  that  have  been  classified  into  a 
higher  category  than  in  current  use  (5).  Alcohols,  long  chain  alkanes, 
alkenes  and  benzene  substituted  compounds  comprise  a large  portion  of  this 
table.  A possible  explanation  for  our  higher  classification  of  these  four 
types  of  compounds  may  be  the  result  of  insufficient  data  because  of  simi- 
lar compounds  are  not  found  in  the  training  set,  and/or  training  set  com- 
pounds in  these  groups  are  being  classified  in  the  higher  category. 

[ 
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7.  CONCLUSIONS 

The  results  for  the  training  set  of  47  compounds  is  summarized  in  Table 
VII.  The  two  linear  learning  machine  called  PLANE  and  MULTI  were  usea  to 
classify  the  compounds.  The  predicted  classifications  in  Table  VII  con- 
sider each  chemical  to  be  a test  sample  and  the  routines  are  trained  on  the 
other  46  chemicals.  This  set  of  analysis  gives  an  overall  estimate  of  68% 
correct  classification.  Table  VIII  lists  the  results  of  PLANE  and  MULTI 
when  the  47  compounds  are  used  as  the  training  set  and  the  240  compounds 
are  used  as  a test  set.  The  predicted  classifications  are  approximately 
68%  correct.  Almost  all  long  chain  and  short  chain  alcohols  are  classified 
or.e  category  higher  by  both  routines.  Table  IX  lists  the  compounds  that 
have  been  classified  upwards  one  category. 

The  final  results  of  this  research  are  encouraging  and  they  have 
definitely  indicated  the  direction  of  future  research. 


8 RECOMMENDATIONS 

The  difficulties  of  properly  classifying  compounds  on  the  basis  of 
variables  derived  from  structural  fragments  has  been  well  documented  in 
this  report  and  current  literature 

It  is  our  opinion  that  these  results  could  be  improved  if  the  follow- 
ing four  c eas  were  given  further  consideration: 

1.  Use  pattern  recognition  techniques  based  on  a property  descrip- 
tor instead  of  Classes  (A,  B,  C,  D).  We  can  predict  measured 
experimental  variables  such  as  AIT,  pressure  piling  etc  much 
better  than  abstract  classifications  that  are  based  on  these 
variables.  We  would  also  have  the  ability  to  accurately  check 
our  results  using  this  approach  and  scientists  would  make  the 
final  evaluation. 

2.  Explore  the  use  of  general  molecular  descriptors,  thermodynamic 
properties  and  electron  density  parameters  as  variables  in  the 
pattern  recognition  techniques.  Such  variables  would  be  obtained 
from  molecular  orbital  programs,  CHETAH  etc.  and  used  in  this 
work  to  improve  the  prediction  capability. 

3.  Devise  better  feature  extraction  procedures  and  transformations 
to  eliminate  the  random  or  noise  components.  A major  problem 
in  pattern  recognition  concerns  the  mathematical  ability  to 
separate  noise  from  real  and  useful  information  in  a variable. 
Better  techniques  for  this  problem  must  be  found. 

4.  Design  a better  learning  machine  program.  The  learning  machine 
has  many  faults  but  one  great  advantage  It  is  extremely  simple 
to  use  after  it  is  trained.  In  fact,  it  can  be  done  on  a:~- 

calculator  Its  major  problem  (accuracy)  has  been  well 
documented  in  the  report.  If  program  modifications  could  be 
developed  to  solve  these  problems,  the  pattern  recognition  field 
would  make  a major  leap  forward. 
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TABLE  I 

Variable  Ordering 


1.  Auto  ignition  temperature 

2.  Total  number  of  hydrogens 

3.  Epoxy  groups 

4.  NO 2 groups 

5.  Molecular  weight 

6.  Solubility  in  ether 

7.  CH3  groups 

8.  NH2  groups 

9.  Total  number  of  carbons 

10.  Carbon-carbon  single  bonds 

11.  Carbon-carbon  triple  bonds 

12.  Etlyer  linkages 

13.  Total  number  of  sulfurs 

14.  NH  groups 

15.  Solubility  in  alcohol 

16.  Carbon-chlorine  bonds 

17.  NHs  groups 

18.  Carbons  without  hydrogens 

19.  Carbon-nitrogen  triple  bonds 

20.  CH  groups 


21  N-C=N  groups 

22  Ethyl  groups 

23.  COH  groups 

24.  Total  number  of  nitrogens 
25  Hydrogens  alpha  to  C=0 
26,  Ester  linkages 

27  Nitrogens  without  hydrogens 

28.  Solubility  m water 

29.  Hydrogens  alpha  to  C=C 
30  Flash  point 

31.  Total  number  of  oxygens 

32.  Boiling  point 

33.  HC=0  groups 
34  Melting  point 

35.  CH2  groups 

36.  C00H  groups 

37.  C=0  groups 

38.  Carbon-carbon  double  bonds 
39  Total  number  of  chlorines 


TABLE  II 


CLASSIFICATION  FUNCTIONS3 


VARIABLE  0 

GROUP  D 

GROUP  C 

GR0L,r>  B 

1 X(l) 

7.13600 

4.97641 

2.76057 

2 X ( 2 ) 

-0.36106 

-0.43583 

-0.40036 

4 X ( 4 ) 

-0.06152 

0 12287 

0.10499 

5 X(5) 

20.51698 

8 62925 

10.51797 

6 X (6 ) 

12.02098 

20.64725 

18  11748 

7 X (7 ) 

6.67528 

6.92967 

3 76196 

8 X(8) 

4.35670 

-6.77126 

-6  94182 

10  X(10) 

-17.96806 

-31 ,78358 

-26  15622 

11  X(ll) 

-37.05293 

-22.64537 

-7,63201 

14  X ( 1 4 ) 

53.06761 

115.01823 

112  80058 

15  X ( 1 5 ) 

-31.01567 

-25.29327 

-15.60228 

16  X ( 1 6 ) 

-26.66524 

11.50037, 

29.91949 

17  X(1 7) 

-21.73500 

53.08704 

61 .99918 

19  X ( 1 9 ) 

-3.70266 

5.61196 

12.42352 

20  X ( 20 ) 

-47.82832 

56.20058 

58.74736 

21  X (21 ) 

-48.94760 

33.24651 

30.70747 

23  X (23 ) 

-53.36461 

-145.65991 

-Ity  .30605 

24  X(24) 

-86.22984 

-109.46097 

-89.71614 

26  X(26) 

-146.34658 

-104.83170 

-70.20551 

27  X(27) 

-43.60481 

-113.88313 

-73.73346 

30  X(30) 

-25.06001 

-14.32377 

-3.85582 

31  X(31 ) 

-66.89027 

-13.61657 

6.76443 

32  X(32) 

-13.45770 

-13  57369 

2,25790 

33  X (33) 

-182.14958 

-114.14978 

-40  13753 

34  X (34 ) 

-54.96423 

-50.60464 

-66,21277 

35  X (35 ) 

-85.81082 

23.76089 

55.32253 

36  X (36 ) 

0.45324 

0 41791 

0.42203 

37  X (37 ) 

-4.26777 

-9.11450 

-6.22199 

38  X(38) 

-3.09302 

-4.28917 

-2.83177 

39  X ( 39 ) 

-33.49681 

-17.37549 

6.43740 

CONSTANT 

-224.74899 

-199.83459 

-188.00497 

Each  numerical  value  in  the  table  is  the  coefficient  of  a linear  polynomial  of  the 
39  variables  plus  the  constant.  For  example,  y = ao+a^ ♦aJXa+a3X.+  The  cal- 
culated values  of  y for  each  compound  can  be  used  to  determine  the  appropriate 
.classification. 
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TABLE  II 

CLASSIFICATION  FUNCTIONS3 


VARIABLE  * 

GROUP  D 

GROUP  C 

GROUP  B 

1 X(l) 

7.13600 

4,97641 

2.76057 

2 X ( 2 ) 

-0.36106 

-0.43583 

-0.40036 

4 X (4 ) 

-0.06152 

0 12287 

0.10499 

5 X(5) 

20,51698 

8.62925 

10.51797 

6 X (6 ) 

12,02098 

20.64725 

18  11748 

7 X ( 7 ) 

6.67528 

6.92967 

3 76196 

8 X(8) 

4.35670 

-6.77126 

-6  94182 

10  X(10) 

-17.96806 

-31 ,78358 

-26  15622 

11  X(ll) 

-37.05293 

-22.64537 

-7,63201 

14  X ( 1 4 ) 

53.06761 

115.01823 

112-80058 

15  X ( 1 5 ) 

-31.01567 

-25.29327 

-15.60228 

16  X ( 1 6 ) 

-26.66524 

11.50037 

29.91949 

17  X(1 7) 

-21.73500 

53.08704 

61.99918 

19  X ( 1 9 ) 

-3.70266 

5,61196 

12,42352 

20  X ( 20 ) 

-47.82832 

56.20058 

58.74736 

21  X (21 ) 

-48.94760 

33.24651 

30.70747 

23  X(23) 

-53.36461 

-145.65991 

-1  Of  .30605 

24  X (24 ) 

-86.22984 

-109.46097 

-89.71614 

26  X(26) 

-146.34658 

-104.83170 

-70.20551 

27  X (27 ) 

-43.60481 

-113.88313 

-73.73346 

30  X (30 ) 

-25.06001 

-14.32377 

-3.85582 

31  X ( 31 ) 

-66.89027 

-13,61657 

6.76443 

32  X(32) 

-13.45770 

-13  57369 

2.25790 

33  X (33) 

-182.14958 

-114.14978 

-40  13753 

34  X (34 ) 

-54  96423 

-50.60464 

-66,21277 

35  X(35) 

-85.81082 

23,76089 

55.32253 

36  X(36) 

0.45324 

0,41791 

0.42203 

37  X ( 37 ) 

-4.26777 

-9,11450 

-6.22199 

38  X(38) 

-3.09302 

-4.28917 

-2.83177 

39  X(39) 

-33.49681 

-17.37549 

6.43740 

CONSTANT 

-224.74899 

-199.83459 

-188.00497 

Each  numerical  value  in  the  table  is  the  coefficient  of  a linear  polynomial  of  the 
39  variables  plus  the  constant,  For  example,  y = a0+a1X,*aaXa+aiXJ+  The  cal- 
culated values  of  y for  each  compound  can  be  used  to  determine  the  appropriate 
. classification. 
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TABLE  III 


Compound 

NAS 

Classification 

Spark3 

Gaps 

(inches) 

Pressure 

Piling 

(lbs/sq . in, ) 

Borderlineb 

Compounds 

Methane 

D 

.044 

77 

Ethylene  Diamine 

D 

.029 

«r 

82 

★ 

Ethyl  amine 

D 

.039 

65 

Styrene 

D 

.037 

133 

Vinyl  Acetate 

D 

.041 

128 

Vinyl  Chloride 

D 

.029 

165 

★ 

Allyl  Alcohol 

C 

.026 

120 

* 

Epichlorohydrin 

C 

.022 

149 

Hydrogen  Sulfide 

C 

.026 

60 

* 

2-Nitropropane 

C 

.021 

130 

Tri ethyl  amine 

C 

.021 

125 

Cyclopropane 

C 

.034 

147 

★ 

Methyl  Acetylene 

C 

.025 

185 

★ 

Ethylene 

C 

.027 

180 

★ 

1,3  Butadiene 

B 

.031 

260 

★ 

Carbon  Disulfide 

B 

.002 

205 

Propylene  Oxide 

B 

.021 

280 

★ 

Hydrogen 

B 

.003 

845 

aSpark  gap  tentative  standards  are  less  than  0.010"  for  A,  B between  0.010"  and 
0.030"  for  C,  and  greater  than  0.030"  for  D. 

Compounds  with  sparks  -0.005  of  the  tentative  standards  are  marked  in  this  column. 


8- 


TABLE  IV 
VARIABLES  USED 

(1 ) Molecular  weight 

(2)  Solubi T i ty  in  ether 

(3)  Solubility  in  alcohol 

(4)  CH3  group 

(5)  Carbon-carbon  single  bonds 

(6)  NH  groups 

(7)  NH2  groups 

(8)  N02  groups 

(9)  Ester  linkages 

(10)  Total  number  of  carbons 

(11)  Total  number  of  hydrogens 

(12)  Auto  ignition  temperatures 

(13)  Epoxy  groups 


■VMBU ■ 
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TABLE  V 
TRAINING  SET 


COMPOUNDS 

NAS 

MULTI 

PLANE** 

Allyl  Alcohol 

C 

C 

C 

Acrolein 

B 

B 

B 

sec-Butyl  Alcohol 

D 

D 

D 

n-Butyl  Aldehyde 

C 

C 

C 

Crotonaldehyde 

C 

C 

C 

Diethylamine 

C 

C 

C 

Di isobutylene 

D 

D 

D 

Epichlorohydrin 

C 

C 

C 

Ethyl  Acrylate 

D 

D 

D 

Ethylene  Diamine 

D 

D 

D 

Ethyleneimi ne 

C 

C 

C 

Hydrogen  Sulfide 

C 

C 

C 

Isopropyl  Ether 

D 

D 

D 

Mesityl  Oxide 

D 

D 

D 

Morpholine 

C 

C 

C 

2-Nitropropane 

C 

C 

C 

Pyridine 

D 

D 

D 

Tetrahydrofuran 

C 

C 

C 

Methane 

D 

D 

D 

Methyl  Formal 

C 

C 

C 

Dimethyl  Ether 

C 

C 

C 

Di-n-Propyl  Ether 

C 

C 

C 

Ethyl  amine 

D 

D 

D 

Tri ethyl  ami  ne 

C 

C 

C 

Cyclopropane 

C 

C 

C 

Methyl  Acetylene 

C 

C 

C 

Propane 

D 

D 

D 

Acetaldehyde 

C 

C 

C 

Acrylonitri le 

D 

D 

D 

Ammonia 

D 

D 

D 

criminant  program  in  the  BMDP  package  and  those  in  the  ARTHUR  package 
when  presented  with  two  variables  that  are  equivalent  or  one  that  is  a 


X / 
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Table  V cont'd. 


COMPOUNDS 

NAS 

MULTI 

PLANE** 

1,3  Butadiene 

B 

B 

B 

Carbon  Disulfide 

B* 

B 

B 

Ethylene  Dichloride 

D 

D 

D 

Ethylene  Oxide 

B 

B 

B 

Isoprene 

C 

C 

C 

Propylene 

D 

D 

D 

Propylene  Oxide 

B 

B 

B 

Styrene 

D 

D 

D 

Unsymmetric  Dimethyl  - 
Hydrazine  (UDMH) 

C 

C 

C 

Vinyl  Acetate 

D 

D 

D 

Vinyl  Chloride 

D 

D 

D 

Para-Xylene 

D 

D 

D 

Hydrogen 

B 

B 

B 

Diethyl  Ether 

C 

C 

C 

Ethylene 

c 

C 

C 

Butane 

D 

D 

D 

Acetylene 

♦Compounds  in  A category 

♦♦PLANE  decides  between 

B* 

with  the  B category. 

two  categories. 

B 

B 

uan  ouivatc  ui  uca^u  vaic  unc  icaiuv  uj  ui  u wwii^vmi. 


» 


ing  its  classification. 
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TABLE  VI 

TRAINING  SET  CLASSIFICATIONS 


COMPOUNDS 

NAS 

MULTI 

PLANE** 

Allyl  Alcohol 

C 

C 

DC 

Acrolein 

B 

B 

BC 

sec-Butyl  Alcohol 

D 

D 

Dc 

n-Butyl  Aldehyde 

C 

C 

CD 

Crotonaldehyde 

C 

C 

CD 

Diethyl  amine 

C 

C 

CD 

Di isobutylene 

D 

D 

DC 

Eplchlorohydrln 

C 

B 

bd 

Ethyl  Acrylate 

D 

D 

DC 

Ethylene  Diamine 

D 

B 

BC 

E thy lenei mine 

C 

C 

CB 

Hydrogen  Sulfide 

C 

8 

Bc 

Isopropyl  Ether 

D 

D 

DC 

Mesityl  Oxide 

D 

D 

DC 

Morpholine 

C 

C 

CD 

2-Nitropropane 

C 

D 

°C 

Pyridine 

D 

D 

DC 

Tetrahydrofuran 

C 1 

C 

CD 

Methane 

D 

C 

CD 

Methyl  Formal 

C 

c 

CB 

Dimethyl  Ether 

C 

c 

CD 

Dl-n-Propyl  Ether 

C 

c 

CD 

Ethyl  amine 

D 

c 

CD 

Tri ethyl  amine 

C 

B 

DC 

tween  the  categories.  This  value  should  reflect  both  the  suspected  “cost 
of  the  misclassifying  of  a compound  that  actually  belongs  to  the  group  B 
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TABLE  VI  cont'd. 

COMPOUNDS 

NAS 

MULTI 

PLANE** 

Cyclopropane  C 

Methyl  Acetylene  C 

Propane  D 

Acetaldehyde  C 

Acrylonitrile  0 

Anmonia  D 

1,3  Butadiene  6 

Carbon  Disulfide  B* 

Ethylene  Dichloride  D 

Ethylene  Oxide  B 

Isoprene  C 

Propylene  D 

Propylene  Oxide  B 

Styrene  D 

Unsymmetric  Dimethyl  C 

Hydrazine  (UDMH) 

Vinyl  Acetate  D 

Vinyl  Chloride  D 

Para-Xylene  0 

Hydrogen  C 


D 

D 

D 

C 

D 

D 

C 

C 

D 

B 

C 

D 

C 

B 

C 

D 

D 

D 

C 


Unclassified*** 

CB 


Diethyl  Ether  C 

Ethylene  C 

Butane  D 

Acetylene  B* 


C 

B 

D 

B 


C 

B 

D 

B 


D 

C 

C 

C 


♦Compounds  in  A category  grouped  with  the  B category. 

**PLANE  is  a two  category  classifier,  with  the  subscript  being  the  category 
choice  between  the  two  categories  originally  not  selected  by  pl?ne. 

***PLANE,  which  examines  only  two  groups  at  a time,  did  not  give  a unique  answer 
for  all  three  pairs. 


described  previously  with  the  46  chemicals  being  the  training  set.  The 
results  of  these  experiments  are  summarized  in  Table  VII,  The  programs 
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TABLE  VII 

INCORRECTLY  CLASSIFIED 


MULTI3 


Category  D 


C 


B 


Methane  (C) 

Ethylene  Diamine  (B) 
Ethyl  amine  (C) 
Styrene  (B) 


15/19  = 79% 

Total  32/47  = 68% 


Epichlorohydrin  (B) 
Hydrogen  Sulfide  (B) 
2-Nitropropane  (D) 
Tri ethyl  amine  (B) 
Cyclopropane  (D) 
Methyl  Acetylene  (D) 
Ethylene  (B) 

14/21  = 67% 


1,3  Butadiene  (C) 
Carbon  Disulfide  (C) 
Propylene  Oxide  (C) 
Hydrogen  (C) 


3/7  = 43% 


PLANE3 


Cateqory  D 

C 

Methane  (C) 

Allyl  Alcohol  (D) 

Ethylene  Diamine  (B) 

Epichlorohydrin  (B) 

Ethylamine  (C) 

Hydrogen  Sulfide  (B) 

Vinyl  Acetate  (C) 

2-Nitropropane  (D) 

Vinyl  Chloride  (C) 

Triethyl  amine  (D) 
Methyl  Acetyene  (D) 
Ethylene  (B) 

14/19  = 74% 

Total  31/47  = 66% 

14/21  = 67$ 

B 

1,3  Butadiene  (C) 
Carbon  Disulfide  (C) 
Propylene  Oxide  (C) 
Hydrogen  (C) 


3/7  = 43% 


aAppendix  II 


TABLE  VIII 

Test  Compounds  taken  from  "Matrix  of  Electrical  and 
Fire  Hazard  Properties  and  Classification  of  Chemicals" 


Number 

Compound  Name 

Classification 

NAS 

MULTI* 

PLANE* 

1 

Formic  Acid 

D 

D 

DC 

2 

Acetic  Acid 

D 

D 

DC 

3 

Propionic  Acid 

D 

D 

DC 

4 

n-Butyric  Acid 

D 

D 

DC 

5 

Acrylic  Acid* 

D 

C 

D-C 

6 

Acetic  Anhydride 

D 

D 

DC 

7 

Propionic  Anhydride 

D 

D 

Unclassified** 

8 

Phthalic  Anhydride 

D 

D 

Unclassified** 

9 

Methyl  Alcohol 

D 

C 

DC 

10 

Ethyl  Alcohol 

D 

C 

CD 

11 

n-Propyl  Alcohol 

D 

D 

Dc 

12 

iso-Propyl  Alcohol 

D 

C 

CD 

13 

n-Butyl  Alcohol 

D 

C 

CD 

14 

sec-Butyl  Alcohol 

D 

Training 

15 

iso-Butyl  Alcohol 

D 

D 

DC 

16 

tert-Butyl  Alcohol 

D 

D 

DC 

17 

n-Amyl  Alcohol 

D 

C 

CD 

18 

iso-Amyl  Alcohol 

D 

C 

CD 

19 

Hexanol 

D 

C 

CD 

20 

Methyl  amyl  Alcohol* 

D 

c 

CD 

21 

Methyl  Isobutyl  Alcohol* 

D 

c 

D-C 

22 

Ethyl  Butanol* 

D 

c 

D-C 

23 

Cycohexanol 

D 

c 

CB 

TABLE  VIII  cont'd 


ll,  n.  H J0  mi 
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TEST  COMPOUNDS  Classificati0n 


Number 

Compound  Name 

NAS 

MULTI 

PLANE 

24 

n-Octyl  Alcohol* 

D 

C 

DC 

25 

iso-Octyl  Alcohol* 

D 

C 

DC 

26 

2-Ethyl  Hexanol* 

D 

C 

DC 

27 

Nonyl  Alcohol* 

D 

C 

DC 

28 

Di isobutyl  Carbonal* 

D 

C 

DC 

29 

n-Decyl  Alcohol 

D 

C 

DC 

30 

iso-Decyl  Alcohol* 

D 

C 

Dc 

31 

Undecanol* 

D 

C 

- -Dc  ** 

32 

Dodecanol 

D 

aT' 

Dc 

33 

Tridecanol* 

D 

c 

Dc 

34 

Tetradecanol* 

D 

c 

Dc 

35 

Pentadecanol* 

D 

c 

Dc 

36 

Allyl  Alcohol 

C 

Training 

37 

Di acetone  Alcohol 

D 

D 

Dc 

38 

Formaldehyde  Solution 

- 

C(pure) 

CB(pure) 

39 

Acetaldehyde 

C 

Training 

40 

Propionaldehyde 

C 

C 

CD 

41 

n-Butyraldehyde 

c 

Training 

42 

iso-Butyraldehyde* 

c 

C 

CD 

43 

Valeraldehyde* 

c 

C 

CD-B 

44 

3-Methyl  Butyraldehyde* 

c 

C 

CD 

45 

iso-Pentyl  Aldehyde* 

c 

C-B 

CD 

46 

2-Ethyl hexaldehyde 

c 

B 

CD 

47 

iso-Octyl  Aldehyde* 

c 

B 

CD 

48 

n-Decaldehyde* 

c 

C 

CB 

49 

iso-Decaldehyde* 

c 

C 

CD 

50 

Acolein 

B 

Training 

51 

Crotonaldehyde 

C 

Training 
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TABLE  VIII 

cont’d  TEST  COMPOUNDS 

Classification 

Number 

Compound  Name 

NAS 

MULTI 

PLANE 

• 1 

52 

2-Ethyl -3-Propyl  Acrolein* 

C 

C 

CD 

53 

Glyoxal* 

C 

B 

B 

1 

1 

54 

Glutaraldehyde* 

C 

C 

CD 

* 1 

55 

Furfural 

C 

C 

CD 

• 

56 

Methane 

D 

Training 

* I 

57 

Ethane 

D 

D 

DC 

| ^ 

58 

Propane 

D 

Training 

£ 

59 

Butane 

D 

Training 

i 

i 

60 

n-Pentane 

D 

C 

CD 

• \ 

i 

61 

iso-Pentane 

D 

D 

DC 

62 

n-Hexane 

D 

C 

CD 

63 

iso-Hexane 

D 

C 

CD 

64 

n-Heptane 

D 

C 

CD 

65 

Octane 

D 

C 

CD 

66 

Nonane 

D 

C 

CD 

67 

Cyclopropane 

C 

Training 

68 

Cyclohexane 

D 

C 

C 

69 

Monoethanol  amine* 

D 

D 

db 

70 

Diethanolamine 

D 

C 

CD 

1 

71 

Triethanolamine* 

D 

D 

DC 

j 

i 

72 

Mo  na  i so  pro  pa  no  1 a ra  i ng* 

D 

D 

Dc 

• 

' 

73 

Di i sopropanol ami ne* 

D 

C 

DC 

• 

74 

n-Ami noethyl  Ethanol  amine 

D 

C 

CD 

75 

Ethyl  amine 

D 

Training 

76 

iso-Propyl amine 

D 

D 

Dr 

i A 


77 


Dimethyl  amine 


C 


C 


.JZ 1 
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TABLE  VIII  cont'd.  TEST  COMPOUNDS 


Classifications 

Number 

Compound  Name 

NAS 

MULTI 

PLANE 

78 

Diethyl  amine 

C 

Training 

79 

Di-n-propyl amine* 

C 

C 

CD 

80 

Di isopropylamine* 

C 

C 

DC 

81 

Tri ethyl  ami ne 

C 

Training 

82 

Ethylene  Diamine 

D 

Training 

83 

Hexamethylene  Diamine  Solutions* 

D 

D-B 

DC 

84 

Diethyl enetri amine 

D 

C 

r 

~D 

85 

Triethylene  Tetramine* 

D 

C 

Dc 

86 

Tetraethyl ene  Pentamine* 

D 

C 

Dc 

87 

Ethylenimine 

C 

Training 

88 

Hexamethyl enimi ne* 

C 

C 

CD 

89 

Aniline 

D 

0 

Dc 

90 

Pyridine 

D 

Training 

91 

2-Methyl -5-Ethyl  Pyrdine* 

D 

D 

Dc 

92 

Benzene 

D 

D 

Dc 

93 

Toluene 

D 

D 

Dc 

94 

Ethyl  Benzene 

D 

D 

Dc 

95 

Cumene 

D 

D 

Dc 

96 

Decyl  Benzene* 

D 

C 

Dc 

97 

Undecyl  Benzene* 

D 

C 

Dc 

98 

Dodecyl  Benzene* 

D 

C 

Dc 

99 

Tri decyl  Benzene* 

D 

C 

Dc 

100 

Tetradecyl  Benzene* 

D 

C 

Dc 

101 

o- Xylene 

D 

D 

Dc 

102 

m-Xylene 

D 

D 

Dc 

103 

p-Xylene 

D 

Training 

, 


4 


1 

i 


. 
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TABLE  VIII  cont  d.  TEST  COMPOUNDS 

Classifications 


Number 

Compound  Name 

NAS 

MULTI 

PLANE 

104 

Xylene  (mixture) 

D 

- 

- 

105 

p-Cymene 

D 

D 

DC 

106 

Diethyl  benzene 

D 

D 

DC 

107 

Tri ethyl  Benzene* 

D 

D 

Dc 

108 

Styrene 

D 

Training 

109 

Vinyl  Toulene* 

D 

D 

Dc 

no 

Naphthalene 

D 

D 

db 

111 

Tetrahydronaphthalene 

D 

D 

Dc 

112 

Mixture 

D 

- 

- 

113 

Methyl  Acetate 

D 

D 

Dc 

114 

Ethyl  Acetate 

D 

D 

Dc 

115 

n-Propyl  Acetate 

D 

D 

Dc 

116 

iso-Propyl  Acetate 

D 

D 

Dc 

117 

n-Butyl  Acetate 

0 

D 

Dc 

118 

sec-Butyl  Acetate 

D 

D 

Dc 

119 

iso-Butyl  Acetate 

D 

D 

Dc 

120 

n-Amyl  Acetate 

D 

D 

Dc 

121 

iso-Amyl  Acetate 

D 

D 

Dc 

122 

Methyl  amyl  Acetate* 

D 

D-C 

Dc 

123 

Vinyl  Acetate 

D 

Training 

124 

Methyl  Acrylate* 

D 

D 

Dc 

125 

Ethyl  Acrylate 

D 

Training 

126 

n-Butyl  Acrylate* 

D 

C 

Dc 

127 

iso-Butyl  Acrylate* 

D 

D-C 

Dc 

128 

2- Ethyl  hexyl  Acrylate 

D 

C 

Dc 
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TABLE  VIII 

Number 

cont'd. 

Compound  Name 

TEST  COMPOUNDS 

NAS 

Classification 

MULTI 

PLANE 

129 

iso-Decyl  Acrylate 

D 

0 

DC 

130 

Methyl  Methacrylate* 

D 

0 

DC 

131 

Propiolactone* 

C 

B 

CB 

132 

Caprolactone* 

0 

C 

CB 

133 

0-Dibutyl  Phthalate 

D 

D 

DC 

134 

o-Diheptyl  Phthalate* 

0 

C 

DC 

135 

Oi octyl  Phthalate* 

D 

C 

DC 

136 

Dinonyl  Phthalate* 

D 

C 

DC 

137 

Diisodecyl  Phthalate* 

0 

C 

DC 

138 

Diundecyl  Phthalate* 

D 

C 

DC 

139 

Butyl  Benzyl  Phthalate*  D 

D 

DC 

140 

Ethyl  Ether 

C 

Training 

141 

iso-Propyl  Ether 

D 

Training 

142 

Ethylene  Oxide 

B 

Training 

143 

Propylene  Oxide 

B 

Training 

144 

Tetrahydrofuran 

C 

Training 

145 

1 ,4  Dioxane 

C 

C 

CD 

146 

Morpholine 

C 

Training 

147 

Epichlorohydrin 

C 

Training 

148 

Dichloroethyl  Ether 

- 

D 

Unclassified** 

149 

Methyl  Formal 

C 

Training 

150 

Propyl  Formal* 

c 

C 

CD 

151 

n-Butyl  Formal* 

c 

C 

CD 

152 

iso-  Butyl  Formal* 

c 

C 

CD 

153 

Furfuryl  Alcohol 

c 

D 

°C 

154 

Ethylene  Glycol 

D 

C 

CD 

155 

Propylene  Glycol 

D 

D 

CD 

156 

1 ,3  Butylene  Glycol 

D 

C 

DC 

-30- 


v * 


1 


j 


TABLE  VIII 

cont'd.  TEST  COMPOUNDS 

Classifications 

Number 

Compound  Name 

NAS 

MULTI 

PLANE 

157 

Hexylene  Glycol 

d 

C 

DC 

158 

Ethylene  Glycol  Monomethyl 

Ether 

C 

C 

CD 

159 

Ethylene  Glycol  Monoethyl 

Ether 

C 

C 

CD 

160 

Ethylene  Glycol  Monobutyl  Ether 

C 

C 

CD 

161 

Diethylene  Glycol 

C 

C 

CB 

162 

Di ethylene  Glycol  Monomethyl 
Ether* 

C 

C 

CD 

163 

Diethylene  Glycol  Monoethyl 
Ether* 

C 

C 

CD 

164 

Diethylene  Glycol  Monobutyl 
Ether* 

C 

C 

CD 

165 

Di ethylene  Glycol  Monobutyl 
Ether  Acetate 

C 

D 

DC 

166 

Dipropylene  Glycol* 

C 

C 

CD 

167 

Triethylene  Glycol 

C 

D 

CD 

168 

Tri propylene  Glcyol* 

C 

C 

Unclassified** 

169 

Methoxy  Triglycol* 

C 

C 

Unclassified** 

170 

Ethoxy  Tri glycol* 

C 

C 

Unclassified** 

171 

Tetraethylene  Glycol* 

c 

C 

Unclassified** 

172 

Ethylene  Glycol  Monoethyl 

Ether  Acetate 

c 

D 

DC 

173 

Ethylene  Glycol  Monobutyl 

Ether  Acetate 

c 

D 

Dc 

174 

Triethylene  Glycol  Di -(2-Ethyl 
Butyrate)* 

c 

C 

Dc 

175 

Glycol  Diacetate* 

D 

D 

Dc 

176 

2-Hydroxyethyl  Acrylate* 

D 

D-C 

Dc 

177 

Glycerine 

D 

C 

CD 

178 

Methyl  Chloride 

D 

D 

DC 

179 

Methylene  Chloride 

D 

D 

DC 

180 

Methyl  Bromide 

D 

D 

DC 

181 

Ethyl  Chloride 

D 

D 

DC 

It/ 


I 


1 


B . 

I 

| 

rj 

i 

* 

1 

[ : 
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TABLE  VIII 

! cont  d TEST 

COMPOUNDS 

Number 

Compound  Name 

NAS 

Classifications 

MULTI 

PLANE 

182 

Ethylene  Chloride 

D 

Training 

183 

1 ,1 ,1-Trichloroethane 

D 

0 

°c 

184 

1 ,2-Dichloropropane 

D 

D 

Dc 

185 

Ethylene  Chiorohydrin 

D 

D 

Unclassified** 

186 

Vinyl  Chloride 

D 

Training 

187 

Vinyl idene  Chloride 

D 

D 

DC 

188 

Trichloroethylene 

D 

D 

DC 

189 

Dichloropropane 

D 

D 

DC 

190 

Ally!  Chloride 

D 

D 

Dc 

191 

Chlorobenzene 

D 

D 

Dc 

192 

o-Dicholorobenzene 

D 

D 

Unclassified** 

193 

1 , 2, 4-Tri chlorobenzene 

D 

D 

Unclassified** 

194 

Acetone 

D 

D 

DC 

195 

Methyl -Ethyl  Ketone 

D 

D 

DC 

196 

Methyl  Isobutyl  Ketone 

D 

D 

Dc 

197 

Di isobutyl  Ketone* 

D 

D 

Dc 

198 

Mesityl  Oxide 

D 

Training 

199 

Cyclohexanone 

D 

D 

Unclassified** 

200 

Isophorone 

D 

D 

DC 

201 

Acetonitrile 

D 

D 

DC 

202 

Acrylonitrile 

D 

Training 

203 

Ethylene  Cyanohydrin 

D 

D 

DC 

204 

Acetone  Cyanohydrin 

D 

D 

DC 

205 

Adiponitrile* 

D 

D 

Unclassified** 

206 

Ethylene 

C 

Training 

207 

Propylene 

C 

Training 

208 

Butylene 

D 

C 

CD 

1 


i 


- ■» 
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TABLE  VIII  contd.  TEST  COMPOUNDS 

Classifications 


Number 

Compound  Names 

NAS 

MULTI 

PLANE 

209 

Butadiene 

B 

Training 

210 

1-Pentene 

D 

C 

CD 

211 

Isoprene 

D 

Training 

212 

Hexene 

D 

B 

CD 

213 

Heptene 

D 

B 

CD 

214 

Octene 

D 

B 

CD 

215 

Di  isobutylene 

D 

Training 

216 

Nonene 

D 

C 

CD 

217 

Tripropylene* 

D 

C 

CD 

218 

Decene 

D 

c 

CD 

219 

Turpentine 

D 

D 

DC 

220 

Dipentene 

D 

C 

CD 

221 

Undecene* 

D 

C 

dc-cd 

222 

Dodecene* 

D 

C 

DC 

223 

Tetrapropylene* 

D 

C 

Unclassified** 

224 

Tri decene* 

D 

C 

CD 

225 

Tetradecene 

D 

C 

CD 

226 

Di cy 1 copentadi ene* 

C 

C 

CD 

227 

Acetylene 

A 

Training 

228 

Methyl  Acetylene-Propadiene 

B 

- 

- 

229 

Aluminum  Triethyl 

- 

- 

- 

230 

Ammonia  (anhydrous) 

D 

Training 

231 

Carbon  Disulfide 

A 

Training 

232 

Dimethyl formami de 

D 

D 

DC 

233 

unsym-Dimethyl  Hydrazine 

C 

Training 

234 

Monomethyl  Hydrozine* 

C 

C 

BC 
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TABLE  VIII 

cont’d 

TEST  COMPOUNDS 

Classifications 

Numbers 

Compound  Name 

NAS 

MULTI 

PLANE 

235 

2-Nitropropane 

C 

Training 

236 

Nitrobenzene 

0 

C 

CD 

237 

Dinitrotoluene 

C 

C 

Unclassified 

238 

Hydrogen 

B 

Training 

239 

Hydrogen  Sulfide 

C 

Training 

240 

Phenol 

D 

D 

Compounds  and  NAS  classifications  are  from  "Matrix  of  Electrical  and  Fire  Hazard 
Properties  and  Classfications  of  Chemicals"  National  Academy  of  Sciences,  Washington, 
D.  C.  (D0T-CG-4168J-A),  1975. 

aMULTI  is  a multicategory  separator  contained  in  the  statistical  package 
routine  called  ARTHUR.  (Appendix  II) 

aPLANE  is  a two  category  separator  contained  in  the  statistical  package 
routine  called  ARTHUR.  The  subscript  denotes  the  choice  between  the  two 
categories  the  compound  was  not  classified  as.  (Appendix  II) 

*These  compounds  had  auto-ignition  temperatures  and/or  solubilities 
missing.  A range  of  their  possible  values  was  made  by  examining  similar 
compounds  and  trends  within  the  groups.  A maximum  interval  of  50°  was  used 
for  the  auto-ignition  temperature  and  of  one  unit  for  the  solubilities.  In 
cases  in  which  a decision  could  not  be  made  both  chosen  categories  are  listed. 

**PLANE,  which  examines  only  two  groups  at  a time,  did  not  give  a unique  classifica- 
tion for  all  three  pairs. 


r 


Methyl  Alcohol 
Ethyl  Alcohol 
iso-Propyl  Alcohol 
n-Butyl  Alcohol 
n-Anmyl  Alcohol 
iso-Antayl  Alcohol 
Hexanol 

Methyl  amyl  Alcohol 

Methyl  Isobutyl  Alcohol 

Ethyl  Butanol 

C.vtlohexane 

n-Octyl  Alcohol 

iso-Octyl  Alcohol 

2-Ethyl  Hexanol 

Nonyl  Alcohol 

Di isobutyl  Carbi.no! 

n-Decyl  Alcohol 

iso-Decyl  Alcohol 

Undecanol 

Dodecanol 

Tridecanol 

Pentadecanol 

iso- Pentyl  Aldehyde 

2-Ethyl  Hexaldehyde 

iso-Octyl  Aldehyde 

Glyoxal 

n-Pentane 

n-Hexane 

n-Heptane 

Octanes 


TABLE  IX 

Compounds  Classified  Upwards 

Cyclohexane 
Diethanolamine 
Di isopropanolamine 
n-Ami noethyl  Ethanol  amine 
Diethylenetriamine 
Tri ethylene  Tetramine 
Tetraethyl ene  Pentamine 
Decyl  Benzene 
Undecyl  Benzene 
Dodecyl  Benzene 
Tri decyl  Benzene 
Tetradecyl  Benzene 
Methyl  amyl  Acetate 
n-Butyl  Acrylate 
iso-Butyl  Acrylate 
2-Ethyl  hexyl  Acrylate 
Propiolactone 
Caprolactone 
O-Diheptyl  Phthalate 
Di octyl  Phthalate 
Di nonyl  Phthalate 
Di isodecyl  Phthalate 
Diundecyl  Phthalate 
Ethylene  Glycol 
Propylene  Glycol 
1,3  Butylene  Glycol 
Hexylene  Glycol 
Triethylene  Glycol 


2 Hydroxyethyl  Acrylate 

Glycerine 

Butylene 

1-Pentene 

Hexene 

Heptene 

Octene 

Nonene 

Tripropylene 

Decene 

Dipentene 

Undecene 

Dodecene 

Tetrapropylene 

Tridecene 

Tetradecene 

Momethyl  Hydra?ine 

Nitrobenzene 

Acrylic  Acid 

Nonane 
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APPENDIX  I 


Simple  Experiments  for  Understanding 
Factor  Analysis  and  Hierarchial  Clustering 


1 

r 1 


INTRODUCTION 

Modern  analytical  tools,  such  as  neutron  activation  analysis  and 
atomic  absorption  spectroscopy,  have  enabled  scientists  to  collect  large 
amounts  of  quantitatively  accurate  information  from  individual  samples. 

When  many  samples  are  involved,  the  scientist  is  then  faced  with  the 
dilerrma  of  interpreting  his  data.  The  conventional  first  step  is  to 
place  all  of  the  data  in  table  form.  Examining  multivariable  data  tables 
in  this  way  can  cause  eye  strain,  but,  except  where  data  values  are 
unusually  different,  it  can  often  lead  to  little  else.  Simple  statistics 
such  as  standard  deviations  and  t-tests  may  tell  the  scientitst  which 
are  outliers,  but  once  again  will  often  show  him  little  of  the  complex 
interrelationships  among  the  variables  or  samples.  The  researcher  may 
then  plot  two-dimensional ly  certain  variables  of  his  data  versus  other 
variables.  This  step  can  be  a great  aid  to  interpretation  since  he  can 
now  see  a spacial  representation  of  relationships  among  the  selected  vari- 
ables. At  the  same  time,  however,  it  is  quite  limited  in  the  amount  of 
information  that  can  be  displayed. 

Another  step  which  has  recently  been  applied  to  chemical  problems 
is  computerized  pattern  recognition,  in  which  all  of  the  variables  (or 
samples)  may  be  compared  to  one  another  to  determine  their  inter  and  intra- 
relationships. Pattern  recognition  is  a developing  branch  of  artificial 
intelligence  (1)  which  has  been  used  for  such  diverse  purposes  as  medical 
diagnosis  (2,3),  the  identification  of  rocks  (4),  and  hand  drawn  character 
identification  (5).  Jurs  (6),  Kowalski  (7),  and  Isenhour(8)  have  described 
how  pattern  recognition  can  be  useful  in  solving  a variety  of  chemical 
problems.  Chemical  applications  have  included  the  identification  and 
interpretation  of  mass  spectra  data  (9),  IR  specta  (10),  NMR  data  (11), 
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gairma-ray  spectroscopy  from  neutron  activation  (12),  and  stationary 
electrode  polarography  (13).  Other  studies  have  included  the  deter- 
ruination  of  the  correct  chlorine  dosages  for  water  treatment  (14);  the 
relationship  between  mass  spectra  data  and  the  pharmacological  activity 
of  drugs  (15);  analysis  for  oil  in  natural  waters  (16);  screening  pros- 
pective anti-cancer  drugs  (17);  and  the  classification  of  archeological 
artifacts  from  trace  element  data  (18), 

Factor  analysis  is  a form  of  pattern  recognition  in  which  the 
linear  combinations  of  a set  of  experimental  data  are  developed  and  this 
hopefully  reduces  the  number  of  variables.  Its  method  has  been  described 
in  detail  by  Veldman  (19)  and  Harman  (20).  This  technique  has  been  applied 
to  such  diverse  areas  as  biology  to  determine  the  growth  patterns  in  plants 
(21);  psychology  to  study  word  recognition  (22)  and  cultural  differences 
(23) ; meteorology  to  study  coastal  air  and  water  temperatures  (24);  and 
geology  to  define  deformational  modes  in  rock  (25).  Chemists  have  used 
factor  analysis  to  study  data  from  nuclear  magnetic  resonance  spectroscopy 
(26)  and  from  gas-liquid  chromatography  (27).  Factor  analysis  has  also  been 
used  to  correlate  trace  element  and  other  chemical  data  collected  from  a 
number  of  samples.  Examples  include  the  study  of  chemical  pollutants  in 
air  samples  (28,  29)  and  the  correlation  of  rocks  based  on  their  chemical 
composition  (30,  31). 

Pre-treatment  of  the  raw  data  may  include  normalizing  the  variable 
(or  sample)  values  to  the  mean  standard  deviation.  The  data  may  then  be 
reduced  to  a correlation  coefficient  matrix.  A number  of  correlation 
coefficient  methods  may  be  used,  including  cosine  coefficient,  distance 
coefficient,  and  Horner  coefficient.  The  product  moment  correlation  coeffi- 
cient is  used  in  this  article's  examples: 
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r s I(xrx)  (y^y) 

Xy  CE(xi-x)l]‘/*[Z(yi-y)a]l/‘ 

The  factor  analysis  method  used  by  the  authors  takes  this  correlation 
coefficient  matrix  and  determines  the  set  of  eigenvalues  for  the  linear 
combinations,  the  cumulative  percentages  of  these  eigenvalues,  the 
eigenvectors,  and  finally  the  loaded  factor  matrix  for  each  of  the 
eigenvalues.  The  general  method  model  used,  that  of  principal 
components  was: 

2j  = ajlFl  + aj2F2  + • • • + ajnFn  (jsl  ,2 n) 

where  each  of  the  n observed  variables  of  the  new  data  matrix  was  des- 
cribed linearly  in  terms  of  the  new  uncorrelated  components,  F (20). 

The  "a"  coefficients  are  the  factor  loadings.  Those  eigenvalues 
considered  to  be  significant  factors  are  retained:  significance  usually 
being  defined  as  a value  greater  than  or  equal  to  1.00.  The  loaded 
factor  matrix  of  significant  factors  then  undergoes  varimax  rotation 
in  order  to  maximize  the  differences  among  the  factors.  This  rotated 
factor  matrix  is  normalized  to  range  from  -1.0  to  +1.0.  A positive 
value  of  a variable  in  a factor  shows  a direct  relationship  of  the 
variable  to  that  factor.  The  greater  the  value  the  stronger  the  rela- 
tionship indicated.  A negative  number  shows  some  inverse  or  negative 
relationship  and  a value  close  to  zero  infers  that  there  is  no  direct 
relationship  between  variable  and  factor.  This  rotated  factor  matrix 
may  then  be  studied  in  either  table  or  graphic  form  in  order  to  interpret 
the  initial  data. 

Another  useful  pattern  recognition  method  is  hierarchial  clustering 
(19).  This  unsupervised  learning  method  clusters  the  samples  from  either 


•> 


■ 


r'v,,. 


r — ,r  ■ 


-41 


the  raw  data  or  a normalized  data  matrix  according  to  their  n'th  dimen- 
sional distance  vector  across  the  variables  The  mean  distance  between 
samples  or  clusters  is  used  to  determine  the  relative  error  of  the 
grouping  and  used  as  the  new  vector  distance  value  for  future  groupings. 
Those  groups  closest  in  distance  values  will  cluster  first.  Eventually, 
all  groups  will  be  clustered  into  two  groups.  A dendrogram  can  then  be 
made  of  the  series  of  clusters  to  give  a graphical  representation  of  the 
calculations.  The  original  data  matrix  may  be  transposed  and  similar 
clustering  may  be  made  of  the  variables  as  they  vary  across  the  samples. 


EXPERIMENTAL 

The  authors  have  developed  a FORTRAN  IV  computer  program  to  handle 
statistical  evaluation  of  data,  perform  correlation  analysis,  factor 
analysis  and  hierarchial  clustering,  and  to  display  data  and  results 
in  either  table  or  graphical  form  (Table  I).  All  calculations  were 
done  at  the  University  of  Rhode  Island's  Computer  Science  Center  on  an 
IBM-370/155  computer  and  graphics  were  done  on  a Broomall  Industries 
2000  Series  Incremental  Plotter.  Data  input  to  the  programs  is  accepted 
from  either  cards  or  from  general  disk  storage  data  banks . Four  corre- 
lation coefficients  are  presently  available:  product  moment  correlation 
coefficient,  cosine  coefficient,  distance  coefficient,  and  Horner  coeffi- 
cient. The  graphic  displays  can  handle  any  data  matrix  from  the  routines, 
from  the  raw  data  to  any  calculated  coefficients.  Almost  all  routines 
may  be  accessed  at  any  time  during  program  operation,  specific  use  being 
governed  by  program  read  control  cards.  Once  the  raw  data  has  been 
entered,  it  may  be  treated  by  any  of  the  procedures  and  the  output  may 
be  returned  to  the  user  in  either  table  or  graphic  form.  The  programming 
package  has  been  designed  so  that  the  user  need  not  have  programming 
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experience  to  operate  it. 

To  illustrate  factor  analysis  as  an  interpretive  tool,  a synthetic 
data  set  consisting  of  fifty  groups  of  length,  width  and  height  values, 
or  in  other  words,  fifty  random  boxes,  was  generated  from  a random  num- 
ber table  (32). 

A group  of  linearly  related  variables  (Table  II)  were  generated 
from  the  length,  width  and  height  values.  Using  factor  analysis  these 
ten  variables  were  reduced  to  three  significant  factors,  each  contain- 
ing about  one  third  of  the  total  variation  among  the  variables.  The 
rotated  factor  matrix  is  shown  in  Table  III.  It  is  useful  to  graph 
the  variable  values  of  the  rotated  factor  matrix  as  they  vary  across 
the  factors.  The  two  dimensional  plot  of  factor  one  versus  factor  two 
(Figure  la)  indicated  that  the  length  variable  was  strongly  associated 
with  factor  one  while  unrelated  to  factors  two  and  three.  The  width 
variables  was  strongly  associated  with  factor  two  and  unrelated  to  fac- 
tors one  and  three,  and  the  height  variable  was  unrelated  to  either  factor 
one  or  two.  The  linear  combination  variables  were  arranged  according  to 
their  weighted  length,  width  or  height  value.  The  plot  of  factor  one 
versus  factor  three  (Figure  lb)  was  nearly  identical  to  the  previous 
figure,  except  that  width  and  height  has  been  reversed.  Plotting  fac- 
tor two  versus  factor  three  also  showed  a similar  result  (Figure  lc), 
this  time  reversing  length  and  width.  Since  each  factor  contained  about 
33  percent  of  the  total  variation,  each  two-dimensional  plot  could  only  give 
about  2/3  of  the  information  available.  Comparison  of  these  three  factors 
on  one  three-dimensional  plot  (Figure  2)  simplified  the  interpretation  of 
the  problem  by  allowing  100  percent  of  the  information  to  be  presented 
at  one  time.  The  x-axis  (right  side)  and  y-axis  (left  side)  represented 
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factors  one  and  two,  respectively,  and  factor  three  was  the  z-axis. 

The  peaks  in  the  rear  three  corners  represented  a value  of  0.0,  or  in 
other  words,  a non-relationship  of  the  variable  to  factor  three  It  was 
interesting  to  note  in  Figure  2 the  successive  progression  along  diagonals 
of  the  associated  length-weighted,  width-weighted,  and  height-weighted 
variables.  The  valid  interpretation  from  this  information  was  that 
factors  one,  two  and  three  were  actually  length,  width  and  height,  respec- 
tively. Such  an  interpretation  would  be  nearly  impossible  to  make  from 
observation  of  the  raw  data  alone.  Table  IV  is  a partial  listing  of  the 
initial  input  information  for  this  example. 

Factor  analysis  was  designed  to  associate  linear  related  variables, 
but  it  may  also  be  used  to  correlate  variables  with  non-linear  relation- 
ships. To  prove  this  point,  a variable  data  matrix  of  cross  product  in- 
formation from  the  boxes  (Table  II)  was  tested  in  a similar  manner.  When 
these  variables  were  handles  exactly  the  same  as  the  preceding  examples, 
three  nearly  equal  factors  were  again  obtained  from  factor  analysis, 
although  in  this  case  they  contained  about  92  percent  of  the  variation 
instead  of  the  100  percent  found  in  the  linear  example.  When  their  rela- 
tive positions  on  the  three-dimensional  plot  of  these  three  significant 
factors  were  observed  (Figure  3),  the  variables  length,  width  and  height 
were  very  strong  in  factors  one,  two  and  three  respectively.  The  inter- 
pretation once  again  was  that  length,  width  and  height  were  the  three 
significant  factors,  as  would  be  expected. 

As  an  added  test  of  factor  analysis,  the  linear  and  cross-product 
data  sets  were  then  combined  and  tested  the  same  way.  Again  three 
significant  factors  were  found,  this  time  accounting  for  96  percent  of 
the  variation.  There  were  no  significant  differences  in  the  rotated 


factor  matrix  values  from  the  previous  two  determinations , and  a study 
of  the  three-dimensional  representation  of  the  three  factors  (Figure  4) 
showed  nearly  the  same  result  that  could  be  found  if  the  three-dimen- 
sional plots  of  the  two  data  sets  alone  were  superimposed. 

Two  additional  tests  were  necessary  to  confirm  the  validity  of 
factor  analysis  in  circumstances  where  the  answer  is  not  known  before 
hand.  Subgroups  of  40,  30,20  and  10  boxes  from  the  linear  variable 
data  set  were  studied  (Table  V)  to  determine  the  effect  of  sample  size 
on  the  results.  In  the  second  test,  the  values  for  the  variables  length, 
width  and  height  were  deleted  from  the  data  matrix  before  factor  analysis 
in  order  to  determine  if  the  use  of  these  three  variables  was  biasing 
the  results.  No  significant  differences  were  found  in  either  experiment 
from  those  results  in  the  initial  studies.  Caution  should  be  taken  in 
applying  these  results  when  interpreting  real  as  opposed  to  synthetic 
data.  The  size  of  the  sample  set  is  important,  too  few  samples  can  cause 
an  incorrect  clustering  and  hence  false  interpretations  of  the  data.  A 
minimum  of  at  least  twice  as  many  samples  as  variables  is  necessary. 

It  is  possible  with  this  program  package  to  rotate  the  three-dimen- 
sional representatio  i about  the  z-axis  or  in  the  X-Y  plane.  The  best 
view  is  data  dependent  because  cluster  representations  can  mask  each  other. 
The  three-dimensional  representation  of  the  linear  box  variable  factor 
matrix  was  used  in  Figure  5 to  demonstrate  this  rotation.  Figures  5a  and 
5b  show  the  plot  rotated  to  relative  positions  of  20°  and  70°  about  the 
z-axis  while  maintaining  the  X-Y  plane  at  45°.  In  Figures  5c  and  5d,  the 
z-axis  position  has  been  returned  to  45°  and  the  X-Y  plane  rotated  to  20° 
and  70°  respectively. 

Hierarchial  clustering  was  applied  to  the  same  boxes  and  their  assoc- 
iated variables,  which  were  examined  earlier  using  factor  analysis.  The 
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dendrogram  of  the  linear  variables  of  the  boxes  (Figure  6)  showed  a 
clear  separation  of  three  cluster  sets:  length  with  length  weighted 
variables,  width  with  width  weighted  variables,  and  height  with  height 
weighted  variables.  As  in  factor  analysis  test,  this  was  the  relation- 
ship that  would  be  expected  to  occur  among  variables  which  are  known 
to  have  a linear  relationship  to  one  another.  The  same  clustering 
method  was  then  used  on  the  cross-product  variables  (Figure  7).  The 
interpretation  of  this  plot  was  less  defined  than  the  first  example. 

Length  clustered  with  length  side  diagonal  and  area  information,  and 
width  clustered  with  width  side  diagonal  and  area  information.  Total 
volume  and  total  surface  area  also  clustered  with  one  another.  Since 
hierarchial  clustering  is  an  unsupervised  learning  method,  however,  the 
multi-interrelationships  among  a set  of  cross-product  related  variables 
tend  toward  noninterpretive  clustering  by  this  method.  When  both  sam- 
ple sets  were  combined  and  tested  by  hierarchial  clustering,  the  result- 
ing dendrogram  (Figure  8)  showed  properties  similar  to  each  of  the  previous 
two  figures,  that  is,  the  linear  variables  were  clustered  into  three 
readily  apparent  groups  of  length-weighted,  width-weighted,  and  height- 
weighted  values,  and  the  two  variables  of  total  diagonal  and  length-plus- 
width-plus-height  variables  also  clustered  closely 

One  final  experiment  was  performed  on  these  boxes  The  data  matrix 
was  transposed  and  the  fifty  boxes  themselves  were  compared  to  one 
another  as  they  varied  across  the  linear  relationship  variables.  Factor 
analysis  gave  three  significant  factors,  each  with  about  one-third  of 
the  total  information.  Hierarchial  clustering  also  showed  three  distinctly 
separate  clusters  (Figure  9),  which  can  be  accounted  for  by  the  general 
groupings  of  boxes  with  a large  width  values  and  usually  large  height  value 


-46- 


I 

bias,  boxes  with  a small  width  bias  with  neither  a length  or  height 
value  bias,  and  boxes  with  a small  length  value  and  a small  height 
value  bias  with  no  bias  of  the  width  value 
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TABLE  1 


PROGRAM  ROUTINES 


Input  Statistics 

Card  Arithmetic  Mean 

Disk  Geometric  Mean 

Median 

Standard  Deviation 
Mean  Std.  Deviation 
Second  Moment 
Third  Moment 
Skewness 
Kurtosis 
Missing  Values 


Correlation  Coefficients 

Cosine  Coefficient 
Distance  Coefficient 
Horner  Coefficient 
Product  Moment  Coefficient 


Clustering 


Graphics 


Factor  Analysis  2-Dimensional  Line  Printer 

Hierarchial  Clustering  2-Dimensional  Computer  Graphics 

3-Dimensional  Computer  Graphics 
Dendrogram  Computer  Graphics 
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TABLE  III 

Rotated  Factor  Matrix 
for  Linear  Box  Variables 


Variable 


Length 

Width 

Height 

L+W+H 

2L+W+H 

L+2W+H 

L+W+2H 

3L+W+H 

L+3W+H 

L+W+3H 


% of  Total 
Information 


Factor 


1 2 3 


0.9986 

. 

- 

0. 9992 

- 

0.5835 

0.8172 

0.5640 

0.9986 

0. 5842 

- 

0.8146 

- 

0.9031 

0.8065 

- 

0.9071 

- 

0.8915 

33.7  % 

32.8  % 

33.5  % 

\ 
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Rotated 

Factor  Matrices 

of  Different 

Sized  Data 

Sets 

Factor  1 

Number  of  Boxes 

Variable 

50 

40 

30 

20 

10 

Length 

0.9986 

0.9981 

0.9993 

0.9749 

0.9947 

Width 

- 

. 

- 

- 

«■ 

Height 

- 

- 

- 

- 

- 

L+W+H 

0.5835 

0.5780 

0.5857 

0.5158 

0.6118 

2L+W+H 

0.8172 

0.8287 

0.8365 

0.8564 

0.8760 

L+2W+H 

- 

- 

- 

- 

- 

L+W+3H 

- 

- 

- 

- 

- 

3L+W+H 

0.9031 

0.9157 

0.9218 

0.9559 

0.9476 

L+3W+H 

- 

- 

- 

- 

- 

L+W+3H 

- 

- 

- 

- 

- 

Factor  2 

Number  of  Boxes 

Variable 

50 

40 

30 

20 

10 

Length 

- 

- 

- 

- 

— 

Width 

0.9992 

0.9969 

0.9998 

0.9898 

0.9590 

Height 

- 

- 

- 

- 

- 

L+W+H 

0.5640 

0.5532 

0.5610 

0.5713 

0.5277 

2L+W+H 

- 

- 

- 

- 

- 

L+2W+H 

0.8146 

0.8237 

0.8120 

0.8593 

0.8966 

3L+W+H 

- 

- 

- 

- 

▼ 

L+3W+H 

0.9070 

0.9190 

0.9047 

0.9432 

0.9790 

L+W+3H 

- 

- 

- 

- 

- 

Factor  3 

Number  of  Boxes 

Variable 

50 

40 

30 

20 

10 

Length 

- 

- 

- 

- 

- 

Width 

- 

- 

- 

- 

- 

Height 

0.9986 

0.9996 

0.9995 

0.9979 

0.9815 

L+W+H 

0.5843 

0.5989 

0.5848 

0.6384 

0.5893 

2L+W+H 

- 

- 

- 

- 

- 

L+2W+H 

- 

- 

- 

- 

L+W+2H 

0.8065 

0.8300 

0.8321 

0.8813 

0.8852 

3L+W+H 

- 

- 

- 

- 

• 

L+3W+H 

- 

- 

- 

- 

• 

L+W+3H 

0.8915 

0.9111 

0.9177 

0.9478 

0.9576 
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APPENDIX  II 


A Description  of  ARTHUR 


This  appendix  is  abstracted  from  a paper  entitled 


"ARTHUR  and  Experimental  Data  Analysis: 


The  Heuristic  Use  of  a Polyalgorithm" 


A.  M.  Harper,  D.  L.  Duewer*  and  B.  R.  Kowalski 
Laboratory  for  Cbemometrics 
Department  of  Chemistry 
University  of  Washington 
Seattle,  Washington  98195 


and 


James  L.  Fasching 
Department  of  Chemistry 
University  of  Rhode  Island 
Kingston,  Rhode  Island  02881 
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"ARTHUR  and  Experimental  Data  Analysis: 

The  Heuristic  Use  of  a Polyalgorithm" 

A.  M.  Harper,  D.  L . Duewer*  and  B.  R.  Kowalski 
Laboratory  for  Chemometrics 
Department  of  Chemistry 
University  of  Washington 
Seattle,  Washington  98195 

and 

James  L.  Fasching 
Department  of  Chemistry 
University  of  Rhode  Island 
Kingston,  Rhode  Island  02881 

Most  non-routine  data  analysis  in  chemistry  is  designed  toaid  the 
formulation  and/or  evaluation  of  some  model  or  hypothesis  of  the  instrinsic 
data  structure.  The  more  detailed  the  model  of  the  data's  structure,  that 
is,  the  more  complete  the  analyst's  understanding  of  the  data,  the  more  facile 
the  selection  of  appropriate  algorithms  for  the  data  analysis.  Conversely, 
where  very  little  is  known  of  the  data's  structure  it  is  difficult  to  make 
a priori  selection  or  evaluation  of  analysis  methodologies. 

ARTHUR  (1 ,2) , a system  of  data  manipulation,  pattern  recognition  and 
robust  statistical  algorithms,  is  designed  as  a tool  for  the  analyst  in  appli- 
cations where  the  data's  structure  is  not  well  understood.  The  algorithms 
included  in  the  system  are  those  which  our  laboratory  and  other  members  of 
the  Chemometrics  Society  have  found  useful  in  the  analysis  of  a number  of 
quite  different  chemical  and  biological  data  sets.  Recently  implemented 
provisions  for  the  inclusion  of  measurement  uncertainties  in  the  mathematical 
methods  (2)  enable  the  determination  of  which  aspects  of  the  data  structure 
are  truly  inherent  to  the  data.  Descriptions  of  these  algorithms  can  be 
found  in  the  appendix  of  this  chapter.  It  should  be  noted  that  ARTHUR  is 
meant  to  be  complementary  to  and  not  in  competition  with  such  primary 
statistical  systems  as  SPSS  (4)  and  BMD  (2). 


The  primary  utility  of  ARTHUR  being  in  the  formulation  and 
evaluation  of  models  for  incompletely-understood  data  sets,  it  is  not 
possible  to  specify  given  algorithms  or  sequences  of  algorithms  which 
are  "best".  However,  in  the  course  of  much  data  analysis  (both  fruitful 
and  frustrating)  some  "rules  of  thumb"  or  heuristic  procedures  have  been 
formulated.  Following  an  introduction  to  the  "ARTHURian"  terminology 
of  data  analysis  and  pattern  recognition,  and  a description  of  the 
inclusion  of  measurement  uncertainties  in  pattern  recognition  methods, 
the  heuristic  techniques  the  developers  and  users  of  ARTHUR  have  found 
most  generally  useful  will  be  described. 


Definitions 


The  following  terms  and  definitions  have  proved  useful  in  describing 
the  types  of  data  analysis  algorithms  available  in  ARTHUR  and  in  des- 
cribing the  data  to  be  analyzed. 

Classification  Analysis 

The  data  are  known  to  be  composed  of  specified  groupings  or  categories. 
The  goals  of  such  analysis  are  the  identification  of  what  parameters  (if 
any)  qualitatively  distinguish  the  known  groupings  and  (if  possible)  the 
selection  of  a classification  rule  for  identifying  the  known  groups. 
Continuous  Property  Analysis 

The  data  are  known  to  represent  a continuous  range  of  responses 
towards  some  given  property(ies) . The  goals  of  such  analysis  are  the 
identification  of  what  parameters  (if  any)  are  functionally  related  to 
the  property  and  (if  possible)  the  selection  of  a rule  which  quantitatively 
predicts  that  property. 

Unsupervised  Analysis 

The  data  are  not  known  to  have  any  systematic  characteristics.  The 
goal  of  such  analysis  is  the  discovery  of  what  systematic  behavior  the 
data  exhibit  (if  any  exists).  Study  of  the  regularities  among  objects 
is  generally  referred  to  as  cluster  analysis;  study  of  the  regularities 
among  measurements  is  generally  referred  to  as  factor  analysis. 

Object 

A compound,  sample,  individual  or  other  entity  for  which  a list  of 
characterizing  parameters  is  present  in  the  data  base. 


Measurement 

An  experimentally  available  parameter  (independent  variable)  used  to 
characterize  the  objects. 

j I 'l 
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Feature 

Any  transformation  of  one  or  more  measurements  used  to  characterize 
the  objects o When  referring  to  a parameter  which  can  be  either  a measure- 
ment or  a feature,  the  term  "measurement/feature"  is  used. 

Data  vector 

The  complete  list  of  measurement/feature  values  used  to  characterize 
a particular  object.  (The  older  chemical  pattern  recognition  literature, 
including  that  of  the  Laboratory  for  Chemometrics,  refers  to  this  as  a 
"pattern".  Considerable  semantic  confusion  over  "patterns  of  patterns" 
forced  the  change  to  the  term,  "data  vector".) 

Category 

One  of  the  groups  of  objects  studied  in  the  classification  analysis 
algorithms.  Categories  which  are  entirely  independent  of  one  another, 
such  as  the  labeling  of  white  bond  papers  by  their  manufacturer,  are 
referred  to  as  discrete  categories.  Categories  which  have  some  dependence 
upon  one  another,  such  as  "low,  middle  and  high",  are  referred  to  as  con- 
tinuous categories. 

Property 

A quantitative  parameter  characteristic  of  the  objects  for  which  a 
functional  representation  is  desired  (dependent  variable). 

Training  Set 

The  list  of  data  vectors  used  to  generate  classification  or  prediction 
rules. 

Evaluation  Set 

The  list  of  data  vectors  used  to  evaluate  the  performance  of  classifi- 


cation or  prediction  rules. 
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Test  Set 

The  list  of  data  vectors,  in  classification  or  prediction  analysis, 
for  which  the  true  category  or  property  value  is  not  known.  The  Evalua- 
tion and  Test  sets  are  functionally  one  and  the  same.  The  Evaluation  set 
is  a "let's  pretend"  Test  set. 

Uncertainty 

The  error  associated  with  an  analytical  measurement.  The  uncertainty 
is  assumed  to  include  all  sources  of  errors  such  as  sampling,  instrumental, 
chemical,  etc. 

It  should  be  recognized  that  these  definitions  are  not  particularly 
rigid  or  mutually  exclusive.  A continuous  property  can  certainly  be 
segmented  into  the  low  resolution  categories  "too  low"  and  "high  enough". 
The  parameter  considered  as  a property  in  one  phase  of  analysis  may  well 
be  a measurement  in  another.  The  Training  and  Evaluation  set  definitions 
may  be  switched.  It  may  even  be  desired  to  switch  the  definition  of  object 
and  feature.  If  the  data  are  considered  as  a matrix  (objects  as  rows  and 
features  as  columns),  the  switch  is  equivalent  to  the  transposition  of  the 
matrix.  And  it  is  certainly  good  practice,  no  matter  what  the  specific 
nature  of  the  data  analysis  problem,  to  make  at  least  cursory  unsupervised 
data  analysis,  if  nothing  more  than  to  give  a rough  screen  for  some  gross, 
unsuspected  structure  in  the  data. 
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Pattern  Recognition:  New  Techniques  that  Utilize  Analytical  Error 

The  general  problem  that  is  amenable  to  solution  by  the  techniques 
available  in  ARTHUR  is  the  analysis  of  patterns  in  an  n-dimenstonal  space. 

In  the  past,  applications  utilizing  pattern  recognition  have  not  taken 
into  account  the  errors  in  the  measurements  because  the  mathematical 
methods  currently  available  make  no  provision  for  their  inclusion.  However, 
in  most  chemical  data  the  inadvertent  assignment  of  zero  measurement  error 
which  results  is  clearly  an  unrealistic  assumption.  This  problem  has  been 
investigated  by  Fasching,  Duewer  and  Kowalski  (3_).  As  a result  of  this 
study,  several  algorithms  in  ARTHUR  have  been  modified  to  include  the 
uncertainties  in  the  calculations. 

Current  pattern  recognition  techniques  treat  measurements  as  dimensions 
in  an  n-dimensional  space.  If,  for  each  member  of  a collection  of  objects 
(samples),  n measurements  are  known,  the  samples  are  represented  as  points 
in  the  space  formed  by  the  measurements.  Therefore,  the  value  of  a given 
measurement  for  a particular  object  serves  to  exactly  position  the  point 
representing  the  object  along  a coordinate  measurement  axis  in  the  n-space. 
Figure  1 depicts  the  configuration  of  the  data  vectors  from  two  samples  in 
a three-dimensional  space.  The  set  of  all  such  vectors  defines  the  data 
matrix. 

In  analytical  applications,  where  the  uncertainties  in  the  measurement 
are  either  known  or  can  be  estimated,  there  exists  a matrix  of  uncertainties 
corresponding  to  and  of  the  same  dimensions  as  the  data  matrix.  Mathematical 
operations  that  transform  the  data  matrix  also  change  the  uncertainty  matrix 
to  a transformed  uncertainty  matrix.  Each  element  of  the  uncertainty  matrix 
reflects  the  exactness  (in  units  of  - one  standard  deviation)  to  which  the 
corresponding  element  of  the  data  matrix  is  known.  Therefore,  each  measure- 
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ment  in  the  data  matrix  is  now  treated  as  a mean  value  with  a probability 
distribution  defined  by  its  error.  The  effect  of  the  inclusion  of  uncer- 
tainties on  the  vectors  in  Figure  1 is  illustrated  in  Figure  2.  The 
analytical  uncertainties  reflect  the  fact  that  a data  vector  is  not, 
in  reality,  a point  in  the  measurement  space,  but  is  the  most  probable 
value  in  a region  of  probability  in  this  space.  Tf  the  area  of  the  elipsoid 
in  this  example  is  defined  at  a 50%  probability  level  of  the  standard  devi- 
ation of  each  measurement,  then  another  set  of  measurements  made  on  a sam- 


ple would  have  an  equal  probability  of  lying  outside  the  elipsoid  as  within 
it.  This  model  is  more  reasonable  for  most  chemical  problems. 

At  present,  ARTHUR  has  been  modified  to  include  the  analytical  error 
in  representative  method  of  preprocessing,  display,  supervised  learning  and 
unsupervised  learning.  A full  description  of  these  modifications  can  be 
found  in  reference  3.  The  current  methods  deal  only  with  symmetric  uncer- 
tainties. A nonmetric  (unsymmetrical ) distance  is  defined;  however,  class- 
ification and  clustering  routines  utilizing  distance  have  not,  as  yet,  been 
similarly  modified  to  make  use  of  this  type  of  distance  matrix. 

Since  the  uncertainty  matrix  contains  information  about  the  error 
associated  with  each  measurement,  it  can  be  incorporated  into  the  prepro- 
cessing of  the  data  matrix.  The  more  realistic  features  generated  can  be 
utilized  in  all  reported  methods  of  pattern  recognition,  thus  eliminating 
the  need  to  change  each  analysis  method.  The  scaling  algorithms  (SCALE)* 
in  ARTHUR  have  been  modified  to  include  uncertainties.  An  error-weighted 


mean  and  variance  are  utilized  in  place  of  the  feature  mean  and  variance 
in  these  calculations.  The  new  mean  of  the  jth  feature  in  the  data  is 


defined  as: 


*Methods  (names  in  capital  letters)  are  described  in  appendix 


where  the  u.  . ‘s  are  the  entries  in  the  uncertainty  matrix  corresponding 

to  the  data  matrix  measurement  x.  . , and  the  sum  is  over  the  training  set 

• *J 

data  vectors.  Modification  of  the  available  distance  metrics  have  also 
been  made  along  with  the  addition  of  new  distance  calculations  based  on 
measurement  uncertainties*  The  algorithms  for  these  can  be  found  in  the 
appendix  (DISTANCE)  to  this  chapter.  The  modified  city-block  distance  and 
the  modified  Mahalanobis  distance  are  now  weighted  by  a function  of  the 
measurement  errors  associated  with  the  features  going  into  the  calculation. 
A new  metric, the  gaussian  overlap-integral  distance,  greatly  emphasizes 
the  features  that  have  a small  distribution  with  respect  to  their  measure- 
ment size  and  related  uncertainties.  A maximum  distance  of  one  is  assigned 
to  features  that  differ  greatly  from  each  other  or  have  very  small  uncer- 
tainties. Another  new  distance  calculation,  the  gaussian  feature-space 
distance,  calculates  a distance  value  that  is  proportional  to  the  probabi- 
lity that  a feature  in  the  ith  data  vector  belongs  to  the  same  population 
as  the  corresponding  feature  of  the  jth  data  vector.  These  are  sunned  over 
all  the  feature  space  to  give  the  intersample  distance  The  calculation 
is  nonmetric  and  the  distance  matrix  is  unsymmetrical . 

The  uncertainty  matrix  has  also  been  incorporated  in  the  Karhunen- 
Lo&ve  transform.  The  modified  technique  transforms  the  uncertainties 
into  a new  certainty  matrix  along  with  the  sample  matrix.  The  assumption 
is  made  that  the  same  degree  of  correlation  applies  to  the  uncertainty 
matrix  as  is  used  in  the  transformation  of  the  sample  matrix. 
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Introduction  to  Data  Analysis  Using  ARTHUR 

Different  collections  of  objects  may  have  quite  different  data 
structures  varying  from  a random  scatter  to  well  defined  clusters  or 
curvilinear  shapes.  Since  each  algorithm  affects  data  reduction  according 
to  the  criterion  upon  which  it  is  based,  a thorough  understanding  of  the 
inherent  assumptions  imposed  upon  the  data  structure  in  the  formulation 
of  a technique  and  the  limitations  that  may  result  can  provide  informa- 
tion helpful  in  arriving  at  an  understanding  of  the  underlying  structure 
of  the  data  when  the  methods  are  applied  in  combination. 

Suppose,  for  example,  the  n-dimensional  structure  of  two  categories 
of  objects  we  wish  to  separate  by  pattern  recognition  classification 
techniques  corresponds  to  the  one  dimensional  problem  depicted  in 
Figure  1,  where  the  shaded  portions  of  the  figure  correspond  to  category 
1 and  the  unshaded  portions  to  category  2, 


Figure  1,  Bimodal  distribution 

Whereas  in  one  dimension  the  solution  to  the  problem  is  obvious,  m n- 
dimensions  the  bimodality  may  not  readily  reveal  itself.  If  PLANE  or 
SIMCA  were  applied  to  these  data,  the  results  might  lead  one  to  believe 


that  the  categories  cannot  be  dist'nguished  since  the  data  a^e  nether 
linearly  separable  nor  continuous,  However,  KNN  would  encounter  no  prob 
lems  since  the  objects  ’n  the  near  vicinity  of  a given  point  tend  to  be 
of  its  class.  Bayesian  classification  (as  long  as  no  a priori  distribu- 
tion  is  assumed)  would  also  produce  good  results.  (Note  that  plots  of 
the  data  might  expose  this  distribution  in  a less  ambiguous  form.  Con- 
sequently, this  example  is  meant  only  as  an  illustration  of  the  effects 
of  the  methods  on  an  easy  to  understand  distribution.) 

Unfortunately,  the  solution  to  a real  problem  does  not,  in  general, 
tend  to  be  as  straightforward  and  may  require  a great  deal  of  inter- 
action and  guidance  from  the  analyst  aided  by  pre-processors,  display 
methods,  and  statistics.  For  this  reason,  the  capabilities  of  ARTHUR 
for  displaying  the  data  are  quite  well  developed  when  combined  with  the 
ingenuity  of  the  analyst  as  will  be  seen  in  a later  section  of  this 
chapter.  On  the  other  hand  since  preprocessing  refers  to  any  method 
that  translates,  rotates,  or  in  any  way  transforms  the  data,  such  in- 
finite diverse  possibilities  arise  that  were  we  to  include  only  those 
methods  that  we  and  others  have  found  useful,  they  would  dominate  the 
code.  Therefore,  the  set  of  preprocessing  tools  available  in  ARTHUR  is 
aimed  mainly  toward  normalization,  feature  weighting,  and  dimensionality 
reduction.  In  addition,  ratios  of  features  can  be  added  to  the  feature 
list  in  TUNE  and  individual  features  can  be  transformed  or  combined  in 
CHFEATURE.  Since  the  methods  chosen  to  preprocess  the  data  can  ulti- 
mately determine  success  or  failure  in  the  solution  of  a problem  by 
pattern  recognition  methods  and/or  the  cost  of  the  analysis,  methods  not 
available  in  ARTHUR  should  not  be  neglected.  An  example  of  this  is  the 
utilization  of  the  Fourier,  Hadamard  and  autocorrelation  functions  for 
the  transformation  of  spectral  data  (6,  7), 

Two  assumptions  made  throughout  any  supervised  pattern  recognition 
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technique  are  that  the  features  used  contain  information  useful  to  the 
solution  of  the  problem  and  that,  even  when  this  is  known  to  be  the  case, 
the  data  can  be  transformed  into  a representation  amenable  to  the  algorithms 
employed.  When  this  is  not  the  case  it  may  become  necessary  to  either 
change  the  form  of  the  question  being  asked  about  the  existing  data  or 
redesign  the  experiment  from  which  the  measurements  are  obtained.  Hope- 
fully, the  information  gained  through  prior  analysis  will  serve  to  guide 
the  analyst  in  this  endeavor. 

We  have  discovered  that  techniques  originally  designed  for  unsuper- 
vised learning  applications  are  powerful  tools  in  the  early  stage  of 
all  data  analysis  problems.  These  methods  have  seen  little  application 
in  chemistry.  Since  the  goal  of  these  methods  is  the  determination  of 
the  existence  of  inherent  data  structures  within  a larger  data  structure, 
neither  training  nor  classification  is  attempted.  TREE  and  HIER  are 
two  unsupervised  learning  methods  which  are  based  on  the  similarity  of 
objects  as  defined  by  their  distances  in  the  feature  space.  Factor 
analysis  can  also  be  utilized  in  this  mode. 

The  following  sections  are  a brief  description  of  the  basis  of  the 
various  pattern  recognition  methods  used  to  analyze  this  data: 

WEIGHT  is  a preprocessing  method  that  weights  each  feature  on  the  basis 
of  its  individual  importance  to  the  solution  of  a pattern  recognition 
problem.  For  categorized  data,  the  criterion  of  importance  can  be  either 
the  total  variance  or  total  Fisher  weight  for  the  feature  The  variance 
weight  is  a ratio  of  the  interclass  variance  of  two  categories  to  the 
Intraclass  variances  of  the  categories.  If  W.  „ is  a measure  of  the 
utility  of  feature  j in  separating  categories  m and  n,  the  variance 

-eight  (WV).  is: 

J ,m,n 


1 


' ' 
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»m,n 


Nm  Nn 

Nm  2 Nn  2 21  xk,m,j  I xk,n,j 

Z k,m,j  + I k,n,j  - k-1 kjO 

k=l  Nm  k=l  Nn NmNn 

Nm  Nn 

^_1(xk,mtj~xm,j)‘  + ^■/Xk>n>j~xntj)2 
Nm  Nn 


where  Ni  is  the  number  of  data  vectors  in  category  i;  the  total  variance 
weight  is  the  geometric  mean  of  the  individual  category  pair  weights. 

The  Fisher  weight  is  a ratio  between  the  square  difference  in  the  cate- 
gory pair  means  and  the  sum  of  intraclass  variances: 


^Wf^j,m,n 


(x, 


Nm  (x  -x  .)2 

I v k,m,j  m„r 
k=l  Nm 
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The  total  Fisher  weight  is  the  arithmetic  average  of  the  individual 
category  pair  weights. 

For  continuous  property  data  the  weighting  is  done  on  the  basis  of 
the  correlation  of  the  feature  to  the  property.  The  square  correlation 
to  property  of  feature  j is: 

N 2 


Z (x 
k=l 


j » k 
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k=l  * 
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where  N is  the  number  of  data  vectors  in  the  training  set  and  pk  is  the 
property  of  the  kth  data  vector. 

SELECT  (28)  is  a feature  selection  technique  that  generates  orthogonal 
features  based  on  their  importance  to  classification.  The  criterion  for 
importance  for  categorized  data  is  the  variance  or  Fisher  weight  and  for 
continuous-property  data,  the  correlation-to-property  weight  (see  WEIGHT). 
The  highest  weighted  feature  is  selected  as  the  first  feature.  The  re- 
maining features  are  then  decorrelated  from  the  chosen  feature.  The  de- 


-78- 


correlated  features  are  reweighted  and  the  feature  whose  new  weight  is 
highest  becomes  the  second  selected  feature.  The  process  continues  until 
either  a specified  number  of  features  is  chosen  or  a given  minimum  weight 
attained.  The  selected  (unweighted)  features  are  output  to  a file  for 
later  use.  The  user  can  opt  for  the  decorrelated  features  or  the  same 
features  in  their  unchanged  form.  Since  one  set  is  a linear  combination 
of  the  other  set,  the  same  information  is  retained  for  either  option. 

Only  the  representation  is  changed  (i.e.  the  sub-feature  space  is  either 
rotated  or  not  rotated  to  orthogonal  axes). 

GRAB.  As  a feature  selection  method,  GRAB  (1_2)  is  intermediate  between 
weight  (with  no  feature  decorrelation)  and  the  more  expensive  SELECT 
(with  total  decorrelation).  A previously-weighted  file  of  n data  vectors 
is  input  to  the  routine.  Each  feature  is  assigned  an  initial  weight 

W(1L  = {Z  (x.  k-x J2}*5 
i k=1  i.k  i 

The  feature  with  the  largest  weight  is  selected  as  the  first  new  feature. 

Each  of  the  remaining  features  is  reweighted  such  that  if  C.  . is  the 

' >J 

correlation  between  the  ith  feature  just  chosen  and  the  remaining  feature 

j. 

w<2>j  = w(i)j[i;lCi,jl] 

X.  L_  iL 

For  the  m iteration  the  weight  of  the  jin  feature  remaining  is 

m-1 

W(m).  = W(l).  n [ 1 - 1 C . -|] 

J J -j=i  '»J 

LEAST  performs  a least-squares  multi-linear  regression  that  is  best  suited 
to  continuous  property  problems.  If  D is  a data  matrix  with  associated 
property  matrix  p,  then  W=(DTD)”^DTP  is  the  least  squares  solution  to  the 
set  of  linear  equations  P=DW  where  W is  a vector  which  weights  the  utility 
of  the  features  in  fitting  the  data, 


In  actual  practice,  determination  of  the  weight  vector  is  done  by 

U = [ETC-1 E]XTP 

where  X is  obtained  by  mean  normalization  of  D,  C ^ is  the  inverted  cor- 
relation matrix  associated  with  D and  E is  a diagonal  matrix  whose  elements 
are  the  reciprocal  variances  of  the  features. 

Prediction  of  an  unknown  property  P'  is  based  on  the  weight  vector 
obtained  is  therefore 

P'  = X'W 

LEDISC  is  a multi-linear  least  squares  regression  designed  for  categorized 

data.  Except  in  property  definitions  it  is  computationally  equivalent  to 

LEAST.  For  a data  set  of  n categories,  n linear  regressions  are  performed 

such  that  for  the  ith  regression  the  property  P is  defined  as 

p _ f+l  for  all  vectors  in  category  i 
Y ~ 1 0 for  all  vectors  not  in  category  i 

An  unknown  data  vector  is  placed  into  that  class  whose  weight  vector  pro- 
duces the  largest  value. 

LESLT  is  a variable  reduction  technique  which  seeks  to  optimize  category 
pair  separation  in  as  few  variables  as  possible.  A feature  derived  is  a 
linear  combination  of  the  original  data  that  describe  the  position  of  a 
data  vector  relative  to  a hyperplane  between  two  categories  in  the  data 
set.  The  input  data  matrix  (X)  of  n categories  is  divided  into  n(n-l)Js 
submatrices.  If  Y is  the  submatrix  containing  only  those  patterns  in 
categories  i and  j plus  the  test  data,  an  outcome  column  matrix  of  prop- 
erties can  be  defined  such  that 

gi*j  _ r-1  for  patterns  in  i 
X+1  for  patterns  in  j 

Thus  defined,  there  exists  a vector  of  weights  such  that  YW^  = G^. 
(Determination  of  Wk  is  the  least  squares  solution  for  this  equation 
(see  LEAST).)  The  weight  vector  obtained  is  used  to  transform  and  classify 


all  the  data  vectors  in  Y.  This  process  is  followed  for  all  category 
pairs.  Once  all  the  weight  vectors  are  obtained,  the  entire  data  matrix 
(X)  is  transformed  such  that  X'  = XW.  The  new  matrix  X'  obtained  has 
n(n-l)/2  features  which  are  approximate  category-pair  separators 
LEPIECE  does  a piece-wise  least  squares  multiple  regression  for  each 
data  vector  in  the  training  and  test  set.  The  property  of  each  data 
vector  is  predicted  from  the  fit  (see  LEAST)  using  the  k-nearest-neighbors 
(see  KNN)  to  the  vectors.  The  value  of  k is  a user -defined  multiple  of 
the  number  of  features.  The  criterion  used  for  "nearest"  is  the  inter- 
pattern distance  (see  DISTANCE).  Only  those  features  used  in  the  deter- 
mination of  the  distance  are  used  m the  regression. 

MULTI  is  a hyperplane  discriminant  function  method  designed  for  multi - 
category  data.  Computationally,  it  is  equivalent  to  PLANE,  except  in 
category  definition.  For  a data  matrix  of  n categories,  n hyperplanes 
are  generated  such  that  the  itln  hyperplane  describes  the  separation  of 
the  ith  category  from  the  rest  of  the  data. 

PLANE  generates  and  classifies  on  the  basis  of  a linear  discriminant 
function  and  is  best  suited  to  data  containing  two  categories  (see  MULTI 
for  multicategory  case).  By  an  error-correction  feedback  method  it  seeks 
a hyperplane  in  an  augmented  n+1  space  (where  n is  the  number  of  features) 
that  best  separates  a pair  of  categories. 

Each  data  vector  in  n space  is  considered  a vector  in  n+1  space 
where  the  n+lth  feature  is  unity.  Therefore,  two  classes  can  be  defined 
as  lying  on  either  side  of  a hyperplane  (whose  equation  in  n+1  space  is 
W«Y=0),  through  the  origin  with  corresponding  class  numbers  +1  and  -1. 

The  discriminant  function  is  calculated  by  first  loading  a weight  vector 
with  random  or  user-defined  values.  During  training,  classification  of 
vector  by  this  weight  vector  is  a decision  of  the  fomi 
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correct , if  the  sign  of  the  response  rela- 
W»Y.  - S.  = tive  to  the  hyperplane  is  the  same  as  the 
K k sign  of  its  class 

incorrect,  if  the  sign  is  not  the  same 
If  a pattern  is  misclassif led,  the  weight  vector  is  adjusted  by  reflec- 
tion of  the  hyperplane  about  the  misclassif ied  point.  The  new  weight 
vector  is  then  used  to  classify  the  data  The  process  continues  until 
all  patterns  in  the  training  set  are  correctly  classified  or  a maximum 
number  of  iterations  are  reached 

For  more  than  two  categories,  a hyperplane  separating  each  pair  of 
categories  is  found.  An  unknown  data  vector  is  then  classified  using  a 
majority  committee  vote  procedure  on  all  the  discriminant  function  re- 
sponses, The  use  of  PLANE  for  multi -category  data  is  equivalent  to  a 
piece-wise  learning  machine. 

REGRESS  is  a multidimensional  multivariate  regression  method  which  com- 
putes a linear  discriminant  function.  It  accepts  both  category  and  con- 
tinuous data.  Two  optimization  methods  are  available.  Either  the  re- 
sidual variance  or  the  multiple  correlation  can  be  minimized. 

STEP  is  a stepwise  multi-linear  regression  method.  Features  used  in  the 
regression  are  determined  by  their  contribution  to  the  overall  variance. 

In  the  regression,  features  are  added  one  at  a time  such  that  the  feature 
that  is  added  makes  the  greatest  improvement  in  the  "goodness  of  fit." 

When  a feature  that  is  indicated  to  be  significant  to  the  reduction  in 
variance  in  an  early  stage  of  the  regression  is  indicated  to  be  insig- 
nificant after  the  addition  of  several  other  features,  it  is  eliminated 
from  the  regression  before  addition  of  another  feature  The  criterion 
for  selection  of  a feature  to  add  or  remove  from  the  calculation  is  as 
follows: 

Removal:  If  the  variance  contribution  is  insignificant  at  a speci- 
fied F-level,  the  feature  is  removed  from  the  regression. 

I 
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Addition:  If  the  variance  reduction  due  to  add  non  of  a feature  Is 
significant  F-level,  this  feature  is  entered  into  the  regression 
HIER  is  an  unsupervised  learning  (cluster  analysis)  method  based  on  the 
relative  similarity  of  a set  of  data  vectors.  Each  vector  is  initially 
assumed  to  be  a lone  cluster.  A similarity  matrix  is  constructed  such 
that  if  S,  • is  the  similarity  between  the  ith  and  jth  data  vector,  then 


i i 

where  g— ^ is  the  interpattern  distance  of  data  vectors  "1"  and  "j" 
amax 

normalized  by  the  largest  interpattern  distance  dm=u  in  the  data  (see 

fua  X 

DISTANCE) . 

The  matrix  is  scanned  for  the  maximum  similarity  in  the  set.  These 
"most  similar"  vectors  are  clustered,  removed  from  the  matrix  and  re- 
placed by  a new  vector  whose  location  is  the  average  of  the  two  vectors. 
In  combining  clusters,  two  options  are  available.  Either  the  average 
of  the  two  clusters  is  weighted  by  the  number  of  data  vectors  in  each 
cluster  or  each  cluster  is  given  equal  weight.  The  new  matrix  is  scanned 
for  the  next  greatest  similarity  and  the  procedure  is  repeated.  The 
process  ends  when  all  the  data  vectors  form  a single  cluster.  Output 
is  in  the  form  of  a connection  dendrogram. 
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